simon
2016-04-08 13:41
has joined #fabric-consensus-dev

jyellick
2016-04-08 13:41
has joined #fabric-consensus-dev

tuand
2016-04-08 13:41
has joined #fabric-consensus-dev

kostas
2016-04-08 13:41
has joined #fabric-consensus-dev

vukolic
2016-04-08 13:41
has joined #fabric-consensus-dev

cca
2016-04-08 13:41
has joined #fabric-consensus-dev

simon
2016-04-08 13:42
# seems to be popular with a lot of people, so i think having an explicit fabric dev channel is better

simon
2016-04-08 13:43
I'm working on #1000 - but i'm not quite sure which data needs to be persisted

simon
2016-04-08 13:43
the checkpoints for sure, but what about P and Q sets?

simon
2016-04-08 13:44
if I persist the P and Q sets, then i need to persist the requests as well

simon
2016-04-08 13:45
but I think ultimately we need to define in #351 which kind of crash fault tolerance we want

simon
2016-04-08 13:46
it would be great if we could discuss and define what kind of crashes we want to be able to tolerate in what way?

simon
2016-04-08 13:49
for example - if >f replicas crash around the same time (and there are f more byzantine replicas), how should the network deal with this?

simon
2016-04-08 13:50
i guess worst case the primary cannot get requests from proposed P and Q sets, and on view change, these proposed requests disappear?

simon
2016-04-08 13:50
in that case we only need to persist checkpoints

simon
2016-04-08 13:51
and possibly the last executed seqno

simon
2016-04-08 14:17
thinking a bit more about it, I don't think we can ever tolerate more than f crashes

simon
2016-04-08 14:17
without giving up correctness guarantees

simon
2016-04-08 14:17
or persisting all in-memory state, such as outstanding requests

simon
2016-04-08 14:19
i think realistically we need to get away from the plain PBFT, and adapt ours to blockchain

simon
2016-04-08 14:20
i.e. treat every block on the chain as a checkpoint, and have a way to retrieve committed requests back from the blockchain

simon
2016-04-08 14:20
then we still need to persist the Pset, but that's more managable

simon
2016-04-08 14:21
however, getting committed requests from the blockchain means that my complaints/dedup work won't work, because there requests are ephemeral, and only transactions are committed, not requests (which carry a timestamp and signature)

simon
2016-04-08 14:22
it is increasingly apparent to me that we cannot develop a silo solution, but we need to use a more holistic design

manish-sethi
2016-04-08 15:07
has joined #fabric-consensus-dev

rgupta1
2016-04-08 15:12
has joined #fabric-consensus-dev

toddsjsimmer
2016-04-08 16:57
has joined #fabric-consensus-dev

novusopt
2016-04-08 17:06
has joined #fabric-consensus-dev

keoja
2016-04-08 17:09
has joined #fabric-consensus-dev

noam
2016-04-10 10:19
has joined #fabric-consensus-dev

paulojrmoreira
2016-04-10 21:53
has joined #fabric-consensus-dev

tim.blankers
2016-04-11 07:12
has joined #fabric-consensus-dev

richernandez2
2016-04-11 14:13
has joined #fabric-consensus-dev

simon
2016-04-11 14:41
so first I thought I'd persist all messages - but then garbage collecting them becomes difficult

simon
2016-04-11 14:42
now i'm back to persisting `pset, qset, reqStore, certStore, checkpointStore, lastExec`

simon
2016-04-11 14:43
i wonder what i really need to persist there

simon
2016-04-11 14:44
maybe i don't need the certstore - that would make things easier

simon
2016-04-11 14:44
@jyellick: what do you think?

jyellick
2016-04-11 14:47
@simon: I would need to do an audit, but that seems like the minimal set of what would be required. I assume the crash-recovery scenario is essentially to conduct a view change, so that everyone has a consistent view of the execution, which would trigger state transfer (as the viewchange already does today)

jyellick
2016-04-11 14:48
What is the strategy for persisting? Would this go into the system state per the pending system chaincode mechanism?

simon
2016-04-11 14:48
right, if fewer than F out of N replicas are byzantine or crashed, then everything proceeds as normal

simon
2016-04-11 14:49
if more than F are crashed, then it should look like they just were disconnected from the network

jyellick
2016-04-11 14:51
I'm not sure I understand that second sentence, could you elaborate?

simon
2016-04-11 14:51
i don't know whether in all cases there would be a view change

simon
2016-04-11 14:52
why can't i find details on this in the pbft paper?

jyellick
2016-04-11 14:55
Yes, it does seem like a significant oversight if it is not in there

jyellick
2016-04-11 14:58
@simon: Why did you say that garbage collection became difficult? I would have thought you could have simply put all messages into a log, then garbage collected all messages with sequence number < X after a stable checkpoint of sequence number X?

simon
2016-04-11 14:58
not all messages contain sequence numbers

jyellick
2016-04-11 14:59
Ah, for requests and potentially Sieve messages?

simon
2016-04-11 14:59
for example

simon
2016-04-11 15:00
also replaying the messages is a bit of a pain, I think

jyellick
2016-04-11 15:01
I guess I would have thought replaying messages would be very simple. Isn't this also the standard crash fault recovery mechanism for most systems?

jyellick
2016-04-11 15:01
(Filesystem journals, db logs, etc.)

simon
2016-04-11 15:01
yea

simon
2016-04-11 15:01
we'll have to change the code that applies messages

simon
2016-04-11 15:02
and remove the conditionals that reject messages

chetsky
2016-04-11 16:58
has joined #fabric-consensus-dev

jatinderbali
2016-04-11 17:12
has joined #fabric-consensus-dev

simon
2016-04-11 17:19
@jyellick: would you make restarting replicas require a view change to join completely?

jyellick
2016-04-11 17:26
@simon: That's a good question, I would say in the case where more than f replicas have gone offline, that it would be a reasonable thing to require.

simon
2016-04-11 17:26
sorry, these questions are all over the place

jyellick
2016-04-11 17:27
Although replaying the journal/log might get things into a consistent enough state to continue, if the primary has missed some of the network chatter, then a view change will need to happen anyway.

simon
2016-04-11 17:27
how does the executor queueing execs influence persistence? we need to atomically track "lastExec"

jyellick
2016-04-11 17:30
So, I think the executor definitely needs a journal, which should be pretty easy to implement, as it is literally a linear stream of monotonically increasing sequence numbered reqeusts.

jyellick
2016-04-11 17:30
I'm not sure about the atomicity, where we're trying to treat the executor as a remote component (as it eventually will be)

simon
2016-04-11 17:31
it has to be atomic

simon
2016-04-11 17:31
because otherwise, how would pbft ever know which request to execute

simon
2016-04-11 17:31
when we come back up, our notion of lastExec needs to match the system state

jyellick
2016-04-11 17:32
Aha, so this is the code that I've stubbed out but is presently unimplemented

jyellick
2016-04-11 17:32
The executor needs to inform the orderer of the execution state on connection

simon
2016-04-11 17:32
now you could say "just roll back to the last checkpoint", which can be a locally correct decision

simon
2016-04-11 17:32
but overall, that way the network could lose already executed requests

simon
2016-04-11 17:33
now, if we treat execution of blocks as the equivalent of a checkpoint...

jyellick
2016-04-11 17:35
The executor can be expected to inform the orderer of the last sequence number it was told to execute. This would be lastExec for classic and batch pbft. For Sieve there would need to be some mapping done

simon
2016-04-11 17:36
right

simon
2016-04-11 17:36
so that is the startupinfo

jyellick
2016-04-11 17:36
Right

jyellick
2016-04-11 17:37
Today it always comes back as 0, because we had no way to persist the state, but after implementing the rest of the crash tolerance (ie a journal) that should be easy.

simon
2016-04-11 17:37
i'm wondering whether we should adapt pbft to blockchain, i.e. treat committed blocks as checkpoints

simon
2016-04-11 17:37
right, i'll work on that part

simon
2016-04-11 17:38
i think it is sufficient to persist reqstore, pset*, qset*, lastExec

simon
2016-04-11 17:39
where pset* is pset updated with certstore

simon
2016-04-11 17:39
ah, plus commits

simon
2016-04-11 17:39
commit certificate

jyellick
2016-04-11 17:39
So are you suggesting setting K=1?

simon
2016-04-11 17:39
not quite

simon
2016-04-11 17:40
that would trigger checkpoint messages all the time

jyellick
2016-04-11 17:40
Right

simon
2016-04-11 17:40
instead, adapt the protocol to not use checkpoint messages at all

simon
2016-04-11 17:40
that ties into the question i posed in #

simon
2016-04-11 17:41
because if the blockchain contains a signed quorum from validators, that would allow us to stop using checkpoints

simon
2016-04-11 17:41
but that's sort of for later

jyellick
2016-04-11 17:42
So, checkpoints today buy us garbage collection, a basis for new views, and a cue for state transfer

jyellick
2016-04-11 17:43
Also, in the case of non-determinism under non-Sieve pbft, it causes the network to stop progressing

jyellick
2016-04-11 17:44
As we move to the new model, non-determinism shouldn't be an issue, and state transfer could be accomplished with the signed blocks

simon
2016-04-11 17:44
right

jyellick
2016-04-11 17:44
But garbage collection and new view would still need to be addressed

simon
2016-04-11 17:45
new view can use these K=1 style "checkpoint" blocks

simon
2016-04-11 17:45
i think

simon
2016-04-11 17:45
garbage collection as well

simon
2016-04-11 17:45
i.e. we'd still have to communicate "I received this, executed it, signed it"

jyellick
2016-04-11 17:46
Then we advance watermarks based on signed commits?

jyellick
2016-04-11 17:46
(which seems awfully similar to a checkpoint message)

simon
2016-04-11 17:46
yea it does

simon
2016-04-11 17:46
well, let's not spend too much time on it right now

simon
2016-04-11 17:46
just thinking aloud

jyellick
2016-04-11 17:46
Fair enough

simon
2016-04-11 17:47
do you tend to push your intermediate commits to your github repo?

simon
2016-04-11 17:47
for the reactor pattern refactor

jyellick
2016-04-11 17:49
Yes, I like to push my intermediate commits, as it makes my life a little easier, though I know it can make bisecting less useful. Had not heard an official stance on that.

jyellick
2016-04-11 17:50
(But I actually haven't committed anything for 919 yet)

simon
2016-04-11 17:53
ah!

simon
2016-04-11 17:53
okay

simon
2016-04-11 17:53
i'm all for frequent commits

simon
2016-04-11 17:53
i was just wondering, because your repo didn't update here - maybe you were on a different repo

jyellick
2016-04-11 17:54
Ah, yeah, no, I've done some work locally, but wasn't really happy with the structure, so went to Jeff, which triggered that whole goose chase that I mentioned via scrum

simon
2016-04-11 17:57
:simple_smile:

simon
2016-04-11 17:57
bbl dinner

jyellick
2016-04-11 17:58
Sounds good, enjoy

simon
2016-04-11 18:26
i'm wondering how to phrase a test for this persistence thing

jyellick
2016-04-11 18:34
Scenario?

simon
2016-04-11 18:36
yea, what scenario. i guess have most replicas disappear, and restart them

jyellick
2016-04-11 18:38
I think it is important that this happen while transactions are being processed. An idle network should halt and and resume normally today.

jyellick
2016-04-11 18:39
Well, actually, I take that back, that is only true if all replicas go down

jyellick
2016-04-11 18:40
Process K+1 transactions, shut f+1 peers down, then start them up again. Execute 1 additional transaction, verify everyone has executed K+2?

simon
2016-04-11 18:46
i guess that works for a start

simon
2016-04-11 18:46
i was thinking of "what if you send a commit and never execute", etc.

jyellick
2016-04-11 18:48
Since you mentioned 'restarting' them, I figured this was behave. I would think those more interesting failure scenarios will probably need to be simulated in unit tests?

simon
2016-04-11 18:48
yea, unit test

simon
2016-04-11 18:49
stop a pbft instance, replace it with a new one, sharing the same ledger

jyellick
2016-04-11 18:52
Ah, I see, yes. I find the unit tests to be much more useful generally anyhow, behave adds so much time in an iterative code/test cycle

chetsky
2016-04-11 20:25
jason? if unbusy&workin'?

jyellick
2016-04-11 20:34
@chetsky: What can I do for you?

chetsky
2016-04-11 20:35
wanted to check on something. Today, there's support for state-xfer, right?

jyellick
2016-04-11 20:35
Yes, there is

chetsky
2016-04-11 20:35
[this is in support of quorum-whitelist change]

chetsky
2016-04-11 20:36
can there be non-validating peers participating in PBFT (somehow .... not sure what that would mean, hence why I ask) that would thus have a full replica of the blockchain?

chetsky
2016-04-11 20:37
if it's useful, I can explain furthre

jyellick
2016-04-11 20:37
So, 'participating in PBFT', I'd say no, unless you're a validating peer, you shouldn't be in the white list, so we shouldn't process PBFT messages from you. There's no technical reason why state transfer could not also poll non-validating peers for blocks, in the event of state transfer though.

chetsky
2016-04-11 20:38
ok. that's what I figured -- there's two kinds of nodes: "validating peers" and "clients", right?

chetsky
2016-04-11 20:38
if you're not a validating peer, you're not going to get the blockchain, period

jyellick
2016-04-11 20:38
Well, non-validating peers do get a copy of the blockchain, but not the state

jyellick
2016-04-11 20:39
I think I would consider a "client' to be someone interacting with the REST interface

jyellick
2016-04-11 20:39
and the types of peers to be "validating" and "non-validating"

jyellick
2016-04-11 20:39
Where "validating" peers execute transactions and participate in consensus, and "non-validating" peers can be connected to by "clients", and keep a copy of the blockchain, but not state

jyellick
2016-04-11 20:40
(This is of course all 'as it stands today', not necessarily a reflection of how things should be)

chetsky
2016-04-11 20:41
I'm not sure I understand what a non-validating peer is, then

jyellick
2016-04-11 20:41
[I'll also point out, I'm not sure that "non-validating peers" actually _work_ with PBFT, I think the integration is only in noops]

chetsky
2016-04-11 20:41
-ah-

chetsky
2016-04-11 20:41
ok.

chetsky
2016-04-11 20:41
um, can you talk voice?

jyellick
2016-04-11 20:41
Sure

chetsky
2016-04-11 20:41
maybe better than IM

chetsky
2016-04-11 20:42
6178695700 ?

chetsky
2016-04-11 20:42
or ph#

keithsmith
2016-04-11 20:45
has joined #fabric-consensus-dev

chetsky
2016-04-11 20:49
sorry, lost you

chetsky
2016-04-11 20:49
can you call? or if you give me a ph# I can call

jyellick
2016-04-11 20:55
Think I lost you again

chetsky
2016-04-11 20:55
dialing

jyellick
2016-04-11 21:00
Signed-off-by: You Name <youremail@domain>

chetsky
2016-04-11 21:00
geennady laventman

akakoudakis
2016-04-11 22:00
has joined #fabric-consensus-dev

mcrafols
2016-04-12 07:41
has joined #fabric-consensus-dev

simon
2016-04-12 11:35
@jyellick: i think we need to make state transfer a bit more proactive - I created a test where i drop a set of commit messages to all but one correct replica (i.e. only that replica gets a commit certificate and executes), and then only that replica can keep on executing - everybody else is stuck

jyellick
2016-04-12 13:35
@simon: Could you elaborate a bit on that scenario? The way I read it, is that you have a 4 replica network, somehow vp3 gets a commit certificate and executes, but vp0,1,2 for whatever reason don't, and do not execute. You're saying the network then stalls? When vp0,1,2 have no execution at that sequence number, I would expect for progress to stop, and a view change timer to trigger, which would then get vp0,1,2 to recover. I assume I'm missing something here?

simon
2016-04-12 13:36
let me gist the test


simon
2016-04-12 13:38
vp3 is down (byzantine), vp1 operates normally, vp0 and vp2 operate "normally*"; they don't receive commits from anybody

jyellick
2016-04-12 13:50
Is this for `TestReplicaCrash1` or `TestReplicaCrash2`?

simon
2016-04-12 13:50
replicacrash2

simon
2016-04-12 13:50
oh i should have gisted just the section

jyellick
2016-04-12 13:51
Sorry, was looking at `TestReplicaCrash1`, let me see...

jyellick
2016-04-12 13:53
Line 967 ``` if filterMsg && dst != -1 && dst != 1 && pm.GetCommit() != nil { ```

jyellick
2016-04-12 13:54
That looks to me like it is filtering all commit messages? I suspect one of those dst filters should be different?

jyellick
2016-04-12 13:54
Oh, wait, nevermind

jyellick
2016-04-12 13:57
(Sorry, somehow read `-1` in both checks)

jyellick
2016-04-12 13:57
But I still think this is a view change scenario

jyellick
2016-04-12 13:57
And my suspicion is that this test is exiting before the view change timer can fire

simon
2016-04-12 13:58
i see

jyellick
2016-04-12 14:03
In fact, I don't see how this could be fixed by state transfer, we've only had one execution, and we need at least f+1 same results to trust it.

simon
2016-04-12 14:03
yea

simon
2016-04-12 14:03
so a view change should clear that up

jyellick
2016-04-12 14:04
Right

carmania
2016-04-12 14:41
has joined #fabric-consensus-dev

simon
2016-04-12 16:12
@jyellick: you around?

simon
2016-04-12 16:12
jyellick: the view change half solves it

simon
2016-04-12 16:13
jyellick: because the log rolls over at view change, there are no null requests included in the new-view message, so the view change timer expires

jyellick
2016-04-12 16:19
@simon I think I'm lost again. The view change timer should be reset once the new-view message is processed. I would not expect null requests from that scenario, shouldn't everyone reprocess that commit that was dropped, then the new requests that they already prepared, so there should be no gaps, and no null requests. I'm not seeing why a timer should be expiring.

simon
2016-04-12 16:29
so far we've reset the timer only when a request executes

jyellick
2016-04-12 16:31
We might want to change that, but I would expect for requests to be executing after this view change? vp1 has 1 committed, and 2 outstanding, and the rest have 3 outstanding, so there should be an execution everywhere?

simon
2016-04-12 16:31
If the timer expires before it receives a valid NEW-VIEW message for v + 1 or before it executes a request in the new view that it had not executed previously, it starts the view change for view v + 2 but this time it will wait 2T before starting a view change for view v + 3.

simon
2016-04-12 16:32
so what happens is that vp1's new view timer expires (nothing executed, because it actually executed all previously), and vp0's timer expires because it has a request outstanding that nobody processed yet

jyellick
2016-04-12 16:34
Sounds reasonable, but is that a problem? If your test is looking for executions? Also, why didn't vp0 rebroadcast that request after the view change?

jyellick
2016-04-12 16:34
[I'd agree though, that it's bad if having no outstanding requests causes us to view-change indefinitely]

simon
2016-04-12 16:47
yea, i'm looking at executions

simon
2016-04-12 16:47
we also don't have a rebroadcast

jyellick
2016-04-12 16:50
Look for `resubmitRequests` at the end of accepting a new view

jyellick
2016-04-12 16:51
It's implemented in `pbft-core.go` and invoked in `viewchange.go` in `processNewView2`

simon
2016-04-12 16:51
yea, that only happens for requests that are at the new primary

jyellick
2016-04-12 16:52
Ah, I see

jyellick
2016-04-12 16:52
And this is not addressed by your complaints work?

simon
2016-04-12 16:52
not for pure pbft

jyellick
2016-04-12 16:53
(Maybe a view change should trigger something in that path?)

jyellick
2016-04-12 16:53
But in pure pbft all replicas should receive all requests? So the primary should have that request in its store?

simon
2016-04-12 16:54
well, we changed that :confused:

simon
2016-04-12 16:54
in any case, these are two separate issues

simon
2016-04-12 16:54
even if there is no outstanding request, there are view changes

jyellick
2016-04-12 16:58
Agreed. I'm having trouble parsing that comment: ``` If the timer expires before it receives a valid NEW-VIEW message for v + 1 or before it executes a request in the new view that it had not executed previously, it starts the view change ... ``` It says "or", but how could we execute a request in the new view before receiving a valid NEW-VIEW message?

simon
2016-04-12 17:34
yea, i don't know

simon
2016-04-12 18:22
there is something weird with the mock test net

jyellick
2016-04-12 18:25
I think I might know what it is

simon
2016-04-12 18:25
sorry, i'm struggling with slack


simon
2016-04-12 18:26
`20:15:10.215 [consensus/obcpbft] maybeSendCommit -> DEBU 481 Replica 1 broadcasting commit for view=1/seqNo=3`

jyellick
2016-04-12 18:26
The call to `makePBFTNetwork` creates a convenience slice, `pbftEndpoints` which the tests can refer to so that they're not constantly having to type assert

simon
2016-04-12 18:26
but:

jyellick
2016-04-12 18:27
But the network is still backed by the `endpoints` slice from the base mock implementation

jyellick
2016-04-12 18:27
So you need to replace the reference in both slices

simon
2016-04-12 18:27
`[36m20:15:10.697 [consensus/obcpbft] recvCommit -> DEBU 4f2 Replica 2 received commit from replica 1 for view=1/seqNo=3`

simon
2016-04-12 18:28
it takes more than 400ms to deliver that message

simon
2016-04-12 18:28
including a time where no activity happens:

simon
2016-04-12 18:29
ah damn

simon
2016-04-12 18:29
wrong test

simon
2016-04-12 18:29
or wrong run

simon
2016-04-12 18:29
i ran it with -count 10

simon
2016-04-12 18:29
let me check again, there was something with a timeout and late message delivery

jyellick
2016-04-12 18:30
Ah, yeah, I really wish go did a better job making sure logs don't interleave, makes the output useless unless you run with `-v`, which is awful if the tests passes

simon
2016-04-12 18:33
oh i use -v

simon
2016-04-12 18:33
what happens without -v?

simon
2016-04-12 18:35
maybe i need to increase the timers

simon
2016-04-12 18:35
seems my laptop is weak, and many tests = slower

jyellick
2016-04-12 18:36
Without `-v` there's no indication of where the previous tests ended and where the new one begins

jyellick
2016-04-12 18:36
Do you use `-parallel 1`?

simon
2016-04-12 18:36
but i've seen this before, long time of nothing, timer expires, later messages get delivered

simon
2016-04-12 18:37
no, i thought the default was parallel 1

jyellick
2016-04-12 18:37
So, usually when this happens, it's because that thread is stuck waiting for the pbft lock

jyellick
2016-04-12 18:37
Default is parallel $GOMAXPROCS I think

simon
2016-04-12 18:37
but that would mean that somebody else is holding it...

simon
2016-04-12 18:37
oh wow

simon
2016-04-12 18:37
how does it separate the outputs then?

simon
2016-04-12 18:38
ah, now it passed

jyellick
2016-04-12 18:38
``` vagrant@ubuntu-1404:/opt/gopath/src/github.com/hyperledger/fabric$ go test -help 2>&1 | grep -A2 parallel -parallel n Allow parallel execution of test functions that call t.Parallel. The value of this flag is the maximum number of tests to run simultaneously; by default, it is set to the value of GOMAXPROCS. Note that -parallel only applies within a single test binary. The 'go test' command may run tests for different packages in parallel as well, according to the setting of the -p flag (see 'go help build'). ```

simon
2016-04-12 18:38
yea

simon
2016-04-12 18:38
silly me, assumptions

simon
2016-04-12 18:39
well, one bug down

simon
2016-04-12 18:39
how's your messaging rework going?

jyellick
2016-04-12 18:40
Trying to fix some of these bad decisions, like having handlers instantiate the plugin, and running into dependency cycles in the imports

simon
2016-04-12 18:40
great, thanks

jyellick
2016-04-12 18:40
Do have the fanin/out stuff written plus accompanying test

simon
2016-04-12 18:40
yea i ran into all of this

jyellick
2016-04-12 18:41
Need to talk with Jeff more, but he was tied up yesterday, and again today with visiting folks

simon
2016-04-12 18:41
so main.go:peer will have a statement that will create a consensus plugin?

simon
2016-04-12 18:41
happy to discuss with you too

jyellick
2016-04-12 18:43
Yes, so I was going to have `peer.go`s `NewPeerWithHandler` function call into `controller.InitializeConsensus` with itself (`peerImpl`) as a parameter, which would instantiate the plugin. Then `handler` would simply refer to the singleton instance in `controller` to route things.

jyellick
2016-04-12 18:44
The problem is, that `peerImpl` isn't actually what's required, it needs to be wrapped by `helper`, which in turn references the interfaces defined in `peer`, and causes this cycle, which I'm trying to break.

jyellick
2016-04-12 18:44
One easy fix, would be to move the interfaces defined in `peer` out into some other package, which both could reference, but which reference neither, breaking the cycle

jyellick
2016-04-12 18:45
But, this would be a moderately large (if trivial) changeset outside of the consensus package, so felt like I should consult with Jeff before stepping on his toes like that.

simon
2016-04-12 18:45
yea, peer is a mixture of multiple things

simon
2016-04-12 18:46
it's the local peer operating, it is singletons for the local peer, and it is instances of connected peers

simon
2016-04-12 18:46
for inspiration, have a look at how tendermint does it - not saying we should do the same

simon
2016-04-12 18:46
they just do it entirely differently

simon
2016-04-12 18:47
but yea, large change, but necessary

simon
2016-04-12 18:48
what about having one module dealing with running a grpc endpoint, and maintaining a list of connected peers, and another module taking incoming messages from that first module, and distributing them to all registered handlers

simon
2016-04-12 18:50
and a third module (or maybe main.go) creating the switchboard, the grpc/network facing side (taking switchboard as consumer), and the handlers (being registered/passed to the switchboard)

simon
2016-04-12 18:50
just an idea

simon
2016-04-12 18:50
it's getting late here, so concentration is waning

simon
2016-04-12 18:51
oh yiss, tests are passing

simon
2016-04-12 18:51
so now i can start adding all the persistence stuff

simon
2016-04-12 18:52
we need more small modules - i think pbft would also benefit from this

jyellick
2016-04-12 18:53
Yes, there are a few things I really would like to see changed. Right now, `peer` acts has this FSM code but consensus filters messages before they reach the FSM, but those filtering decisions need to be made based on the state of the FSM. Plus, `peer` does too much together, it does the under the covers state transfer stuff, and the peerEndpoint stuff, which are such separate ideas. It seems quite conceivable that we would want the peerEndpoint information available and none of the state transfer nonsense.

simon
2016-04-12 18:53
or at least well encapsulated interfaces

jyellick
2016-04-12 18:53
I think each of these little modules, like say peerEndpoint, statetransfer, etc. should each be in their own encapsulated piece of code. Then, depending on the type of deployment, you pick which ones you want to glue together.

simon
2016-04-12 18:54
regarding "what if another consensus wants to do something per new connection": the tendermint code has an API for that as well: `AddPeer`, `RemovePeer`, and messages coming in

simon
2016-04-12 18:54
it's a bit different, because they implement gossip in the consensus

simon
2016-04-12 18:55
and they use a lot of goroutines, which means no thorough unit tests (at least I didn't see them) (because distributed system)

simon
2016-04-12 18:55
but that could conceivably be an event

simon
2016-04-12 18:55
peer attached/detached

jyellick
2016-04-12 18:57
Right, if we were more event driven with that, it would also eliminate the polling pattern that had to be used for the whitelisting

mandler
2016-04-13 08:05
has joined #fabric-consensus-dev

davidcosta
2016-04-14 09:19
has joined #fabric-consensus-dev

simon
2016-04-14 10:38
this executor code is complex

simon
2016-04-14 10:38
do we really need all these threads and queues?

simon
2016-04-14 11:43
i can't deal with this executor; i'm looking at how to remove it again

jyellick
2016-04-14 13:04
There's only one thread in the executor

jyellick
2016-04-14 13:04
And one queue?

simon
2016-04-14 13:08
yea, i just can't deal with it

simon
2016-04-14 13:09
way too complicated

jyellick
2016-04-14 13:09
I wouldn't be overly opposed to ripping it out of Sieve, the `Validate` stuff is pretty ugly, without that bit of code, it would simplify a lot

jyellick
2016-04-14 13:10
But I think the simplifications it brings to pbft-core and classic/batch are worth it

simon
2016-04-14 13:10
what kind of simplification?

simon
2016-04-14 13:11
i just ripped out a hundred lines of code, and it still seems to work

jyellick
2016-04-14 13:14
By making it synchronous?

jyellick
2016-04-14 13:19
Pushing all state modification onto a single thread simplified pbft-core a lot from a state transfer perspective. There was also a lot of code duplication in classic/batch. And completely separating the pbft-core and execution bits have made it more clear (at least to me) where issues are occurring.

simon
2016-04-14 13:20
i agree, the more synchronous, the better

jyellick
2016-04-14 13:21
We'll need to support remote (which means asynchronous) execution in the future?

simon
2016-04-14 13:24
i don't know

simon
2016-04-14 13:24
i don't think it would be asynchronous

jyellick
2016-04-14 13:30
I suppose we could make it more of a synchronous RPC call, that is just not how anything works today, everything is an asynchronous message on the stream.

jyellick
2016-04-14 13:50
Maybe you could tell me specifically what you're interested in ripping out, I've been reviewing that code, and I'm not seeing a ton which is dedicated to keeping things asynchronous.

simon
2016-04-14 14:03
i'm testing what happens if i just get completely rid of the executor

simon
2016-04-14 14:05
i tried to persist the lastexec seqno, and I didn't see any obvious way how to do that with the executor

simon
2016-04-14 14:05
so my choice is: remove the complexity until I can reason about it again, and work with the code again

simon
2016-04-14 14:06
or give up and not work on the code anymore

jyellick
2016-04-14 14:15
Well, the choice seems pretty obvious. My concern with removing the executor code is simply that some of the complexity is due to some nasty race type corner cases, and I want to make sure they aren't re-introduced

simon
2016-04-14 14:16
i agree

simon
2016-04-14 14:17
let's try to make this thing as synchronous as possible

simon
2016-04-14 14:18
if we use a channel to pipe in all messages, we can skip all locks

jyellick
2016-04-14 14:18
+1 on channels over locks

simon
2016-04-14 14:18
and we should have a way to send results for RPCs

simon
2016-04-14 14:19
i.e. once a message has been accepted into persistent custody

simon
2016-04-14 14:19
or maybe the transaction submission should only reply with a transaction uuid if it was accepted by the consensus network?

simon
2016-04-14 14:19
i don't know

jyellick
2016-04-14 14:21
I'd like to figure this out now, as it impacts #919

jyellick
2016-04-14 14:21
When a message comes in, we can reply with an error, or not, and those are our only two options.

jyellick
2016-04-14 14:22
(As the code works today)

simon
2016-04-14 14:22
i know

simon
2016-04-14 14:22
that's why I'm talking about it :simple_smile:

simon
2016-04-14 14:22
the peer code isn't in a shape to wait on some event before it will reply to the originator

simon
2016-04-14 14:22
but it should

jyellick
2016-04-14 14:22
Well, I think that is by design

jyellick
2016-04-14 14:23
gRCP supports multiple models, and the stream one was chosen explicitly because of its future scalability, I believe

simon
2016-04-14 14:23
i mean, even if we persist an incoming request to disk - we might go up in flames and the request never gets pushed to consensus

simon
2016-04-14 14:23
so we should wait until the request is prepared

simon
2016-04-14 14:23
technically

simon
2016-04-14 14:24
the transaction should come with a callback we can invoke

simon
2016-04-14 14:24
which then can send a reply to the originator

simon
2016-04-14 14:24
or a token, and somebody else stores the callback state

simon
2016-04-14 14:24
but let's fix that later

simon
2016-04-14 14:24
this is too much to fix in one go

jyellick
2016-04-14 14:25
What about simply sending a unicast message to the submitter's handle?

jyellick
2016-04-14 14:26
(Though in the case of the REST API, this would be local)

jyellick
2016-04-14 14:27
I guess that is why something token or callback based makes more sense. In the case of a non-local gRPC connection, ie coming from an NVP, the notification should go back over gRPC, but in the case of REST, it should all be internal

simon
2016-04-14 14:27
right now REST uses grpc as well

jyellick
2016-04-14 14:28
Jeff and I were looking to rip that out today or very soon

simon
2016-04-14 14:28
which makes plenty sense, because it is just a special case of the generic NVP REST gateway

jyellick
2016-04-14 14:28
The problem is that it is a performance bottleneck, and necessarily sidesteps authentication

simon
2016-04-14 14:28
did you guys have a look at my performance branch modifications?

jyellick
2016-04-14 14:28
We have not

simon
2016-04-14 14:28
i converted the devops interface into an actual RPC

simon
2016-04-14 14:29
instead of this streaming interface

simon
2016-04-14 14:29
let gRPC handle the paralleling & pipelining

simon
2016-04-14 14:29
I didn't use authentication yet, because nothing seems to use TLS anyways

simon
2016-04-14 14:30
and authentication inside of TLS is meh - no protection against MitM

jyellick
2016-04-14 14:31
I don't disagree with the idea of letting gRPC do the paralleling and pipelining, it makes a lot of sense to me, but I have to believe that the stream interface was a well thought out decision (maybe I am giving us a little too much credit though)

simon
2016-04-14 14:31
stream is fine to push consensus messages around

simon
2016-04-14 14:31
for a devops interface - why?

simon
2016-04-14 14:32
these *are* RPCs

simon
2016-04-14 14:32
submit, wait for result

tuand
2016-04-14 14:33
i believe this is what keith smith and anya are looking at for their sdk work

simon
2016-04-14 14:34
why is the team not active in #?

simon
2016-04-14 14:34
why aren't any of these design deliberations a topic of discussion in #?

tuand
2016-04-14 14:35
customer meeting in RTP this week

simon
2016-04-14 14:35
sure, but clearly people are working/thinking about things

tuand
2016-04-14 14:35
true, although we might need to prod people to participate

simon
2016-04-14 14:36
this is supposed to be an open and distributed project

simon
2016-04-14 14:36
and it absolutely doesn't feel that way

inabatk
2016-04-14 15:07
has joined #fabric-consensus-dev

jyellick
2016-04-14 15:53
@simon: Are you still around?

simon
2016-04-14 15:53
i am

jyellick
2016-04-14 15:54
In `pbft-core.go` there `innerBroadcast` call

jyellick
2016-04-14 15:54
First, we `instance.consumer.broadcast`, then we `instance.recvMsgSync` if we are supposed to send to ourselves as well

jyellick
2016-04-14 15:56
Imagine we are running with 3 replicas up (one failed), we broadcast a checkpoint, which another replica receives, and moves its watermarks, and starts sending us requests above our high watermark, because we have not processed the checkpoint ourselves yet

jyellick
2016-04-14 15:57
This is what I am observing in the stress test now, and it seems to break our network

simon
2016-04-14 15:57
do we drop the lock around `broadcast`?

simon
2016-04-14 15:57
i changed this stuff in my persistence code anyways

jyellick
2016-04-14 15:57
No, though I'm not sure why it would matter? Since these are separate processes

simon
2016-04-14 15:58
because if we don't drop the lock, we shouldn't process the new requests or checkpoints until we processed our own

simon
2016-04-14 15:58
i.e. broadcast and processing of our own message should be atomic

jyellick
2016-04-14 15:58
Ah, I see, hmmm

simon
2016-04-14 15:59
maybe others moved forward more quickly?

simon
2016-04-14 15:59
or does that happen for all replicas?

jyellick
2016-04-14 16:01
We definitely do retain the lock, so I am not sure why this is happening

jyellick
2016-04-14 16:02
So, one replica falls behind because it is simply slower than the rest, and we are being flooded with requests. And one replica ends up getting new pre-prepares for sequence numbers that are outside of its watermarks, but the lock should prevent that.

simon
2016-04-14 16:05
yea so what is going on

jyellick
2016-04-14 16:07
So I think this is what's happening vp0 - send checkpoint vp0 - receive vp0 checkpoint vp1 - send checkpoint vp1 - receive vp1 checkpoint vp2 - send checkpoint vp2 - receive vp0,vp1,vp2 checkpoint vp2 - move watermarks and send pre-prepare vp0,1 - receive vp2 preprepare and ignore This should trigger a view change, which is not great, but it should not lock up the network

simon
2016-04-14 16:08
yea there is an issue that PBFT glances over

simon
2016-04-14 16:08
it assumes an infinite "incoming messages" store

simon
2016-04-14 16:08
and picks out the ones it can operate on

simon
2016-04-14 16:19
@jyellick: i'm trying to understand how in your code, pbft will trigger a state transfer

simon
2016-04-14 16:20
something with `weakCheckpointSetOutOfRange`

jyellick
2016-04-14 16:20
Yes

simon
2016-04-14 16:21
so that sets `skipInProgress`

jyellick
2016-04-14 16:21
The code tracks the last checkpoint message which was above our watermarks, for each replica.

jyellick
2016-04-14 16:22
Once f+1 replicas agree that there is a checkpoint above our watermarks, then it sets `skipInProgress` so that PBFT knows that it is out of date.

simon
2016-04-14 16:22
right

simon
2016-04-14 16:22
but how does the state transfer start?

jyellick
2016-04-14 16:22
Once PBFT observes a weak checkpoint, so, a valid state, then it tells the executor to `SkipTo` that checkpoint ID

simon
2016-04-14 16:23
ah!

simon
2016-04-14 16:23
but isn't that the same?

jyellick
2016-04-14 16:23
There's a comment in there that we should reprocess in case it is the same, but it is not necessarily.

jyellick
2016-04-14 16:24
We could have f+1 checkpoints for the same sequence number with different ids, or we could have 3 checkpoints for different sequence numbers, all of which are above our watermarks.

jyellick
2016-04-14 16:24
Usually, they will be the same, but not always.

simon
2016-04-14 16:24
ah

simon
2016-04-14 17:13
@jyellick: i think we need to set `activeView = true` when we do `skipTo()`

simon
2016-04-14 17:13
do you agree?

jyellick
2016-04-14 17:14
Yes, I think you're right

simon
2016-04-14 17:14
how do all of these things work, if we have so many bugs everywhere? :simple_smile:

jyellick
2016-04-14 17:15
It is kind of surprising...

simon
2016-04-14 17:15
i guess we should also update our view

simon
2016-04-14 17:15
but how

simon
2016-04-14 17:16
well, i can't just set activeView

simon
2016-04-14 17:16
i need to set view to the right view

simon
2016-04-14 17:16
(`TestFallBehind`)

jyellick
2016-04-14 17:16
Yes, this is one of those pieces of the paper that I dislike, it is inconsistent with respect to 'falling behind'

jyellick
2016-04-14 17:18
Per the paper, we are supposed to simply reject all messages with sequence numbers outside of our watermarks, but later it refers to checking checkpoints above the watermarks, I had asked Marko, and he thought the f+1 checkpoints above was a good solution, but it kind of ignores the view thing

simon
2016-04-14 17:20
replica 3 is effectively disconnected (doesn't receive seqno=1 message)

simon
2016-04-14 17:20
then it goes into view change

simon
2016-04-14 17:20
later it observes checkpoints, catches up

simon
2016-04-14 17:20
but it is still in view change

jyellick
2016-04-14 17:21
Right

jyellick
2016-04-14 17:25
It seems like we not only need to track the out of bounds checkpoints, we then also need to get f+1 agreement on the current view? We could then set it active, to whatever that view happens to be.

jyellick
2016-04-14 17:27
We are still never really promised that we have not already missed some requests, we are basically hoping that we moved our window in time, but I don't think we can do better than that without some significant modifications to the protocol. Effectively we need a view change to be certain of it.

simon
2016-04-14 17:29
or we keep messages around

simon
2016-04-14 17:29
yea i don't know

simon
2016-04-14 17:29
will you send me an invite for the discussion with jeff?

christophera
2016-04-14 17:30
has joined #fabric-consensus-dev

jyellick
2016-04-14 17:30
Let me ask Jeff if he has a better idea of when he'll free up. Would send the invite now but I don't have a firm time yet

jyellick
2016-04-14 17:33
[Also, we are somehow deadlocking when the view change timer fires, I see `Replica 2 view change timer expired, waiting for lock with expired count 2` and then the messages stop, also seeing `Replica 1 view change timer expired, waiting for lock with expired count 2` in the same network, which basically locks up the network. This is in classic, not seeing anywhere where someone is obviously blocked waiting for the lock]

jyellick
2016-04-14 17:35
@simon: Jeff says 4pm EST

jyellick
2016-04-14 17:39
(Will send out a notes invite shortly)

guruprasath
2016-04-14 18:33
has joined #fabric-consensus-dev

gengjh
2016-04-15 01:08
has joined #fabric-consensus-dev

vipinb
2016-04-15 01:32
has joined #fabric-consensus-dev

sheehan
2016-04-15 01:37
has joined #fabric-consensus-dev

nicholas
2016-04-15 09:24
has joined #fabric-consensus-dev

jyellick
2016-04-15 14:58
@simon: Are you around?

ghaskins
2016-04-15 17:06
has joined #fabric-consensus-dev

ajlopez
2016-04-15 22:06
has joined #fabric-consensus-dev

nathonline
2016-04-17 18:09
has joined #fabric-consensus-dev

cbf
2016-04-17 20:04
has joined #fabric-consensus-dev

mark.moir
2016-04-17 22:01
has joined #fabric-consensus-dev

simon
2016-04-18 12:26
with my remove-executor changes most seems to work fine

simon
2016-04-18 12:27
just that the state transfer parts still trigger a view change, and then the new view doesn't match anybody else

simon
2016-04-18 12:27
and `TestReplicaCrash2` is dropping requests, because they're not being broadcast, but that's an old change

simon
2016-04-18 12:27
maybe i should just change the test evaluation

simon
2016-04-18 12:28
currently waiting on the docker image to build to run behave tests

simon
2016-04-19 15:39
statetransfer and pbft are way too intertwined

jyellick
2016-04-19 17:01
How so?

jyellick
2016-04-19 17:04
(I'm assuming you mean in our code, and not as a general protocol deficiency)

jonathan.mohan
2016-04-19 20:37
has joined #fabric-consensus-dev

kelly
2016-04-20 11:22
has joined #fabric-consensus-dev

simon
2016-04-20 14:51
yey, finally at a point where i can run bdd tests again

tuand
2016-04-20 14:51
what did you do ? I'm running into connection failed on login in behave today

simon
2016-04-20 14:52
i mean after all my code changes


simon
2016-04-20 14:55
yes, i know, a lot of commits

simon
2016-04-20 15:44
jyellick, tuand: do you want to discuss the implementation for #919, #973, etc.?

jyellick
2016-04-20 15:45
@simon: Yes, I've been thinking on it, haven't completely figured it out in my head yet

jyellick
2016-04-20 15:45
My big first question would be, do you think that the plugins should share the single PBFT thread?

simon
2016-04-20 15:46
what plugins?

tuand
2016-04-20 15:46
thursday ? i really need to finish #756 getting through the behave tests

simon
2016-04-20 15:46
i think we need to continually discuss design and implementation strategies

jyellick
2016-04-20 15:49
By plugins I mean classic/batch/sieve

simon
2016-04-20 15:51
ah yes

simon
2016-04-20 15:54
ideally it would just be a state machine

simon
2016-04-20 15:54
event driven

simon
2016-04-20 15:54
and some wrapper around it serialized the events coming in

simon
2016-04-20 15:54
e.g. via channel, or lock

simon
2016-04-20 15:55
depending on whether the events should return an error code, or not

simon
2016-04-20 15:55
and then maybe we even use a FSM tool, instead of open coding the state machine as it is right now

simon
2016-04-20 15:55
not talking about per-request state

simon
2016-04-20 15:56
but about "in view change", "waiting for checkpoint", etc.

simon
2016-04-20 15:57
timer expiring then would also be an event being injected

jyellick
2016-04-20 16:00
(have to run, will respond in a bit)

tuand
2016-04-20 16:07
say for a timer expired event, does that need to go the front of the queue ?

simon
2016-04-20 16:11
whatever the system is that implements the event delivery, it would create/deliver the timeout event at the right time

simon
2016-04-20 16:12
i think we have to implement stronger separation of concerns - smaller structures

tuand
2016-04-20 16:45
yes, i'd like to see smaller components as well

tuand
2016-04-20 16:45
different separation than what jason's done with executor ?

simon
2016-04-20 16:50
i think separated code should not have its own threads/etc

simon
2016-04-20 16:50
ideally

simon
2016-04-20 16:51
because that just makes it so complicated to reason about

simon
2016-04-20 16:51
i don't even understand our test framework anymore

tuand
2016-04-20 16:52
behave :stuck_out_tongue_winking_eye:

simon
2016-04-20 16:52
yea that doesn't work well either

simon
2016-04-20 16:53
some docker containers fail in my tests

simon
2016-04-20 16:53
not always

simon
2016-04-20 16:53
just sometimes

tuand
2016-04-20 16:54
not to hijack this conversation too much but i posted a behave issue in # a couple days ago ... we should really have a #testing or #tools channel

tuand
2016-04-20 16:55
coming back to discussion, so basically components talking to each other via some sort of queues ?

simon
2016-04-20 16:55
wouldn't that be #?

simon
2016-04-20 16:55
ideally no async communication as well

simon
2016-04-20 16:55
i prefer as much as possible to be synchronous and encapsulated

simon
2016-04-20 16:56
and we need to come up with a better solution than marshalling/unmarshalling data left and right

simon
2016-04-20 16:56
it makes sense when shipping over network

simon
2016-04-20 16:57
internally it just breaks static type checking

tuand
2016-04-20 16:58
but then haven't we serialized everything ?

simon
2016-04-20 17:01
yes!

simon
2016-04-20 17:01
all serial = no more internal races

simon
2016-04-20 17:02
oh my

tuand
2016-04-20 17:02
i thought we were talking scalability issues as well

simon
2016-04-20 17:02
now my consensus helper is supposed to signal to the consenter that the state transfer has finished

simon
2016-04-20 17:02
but the helper doesn't know about the consenter - only vice versa

simon
2016-04-20 17:04
i give up

simon
2016-04-20 17:04
this code wants to be hacked, not carefully designed

simon
2016-04-20 17:14
now i want to know whether state transfer works

jyellick
2016-04-20 17:32
Back. Maybe you could help me understand why you care if/when state transfer finishes? In general, I tried to be careful with the executor to make sure that state transfer was an atomic non-blocking operation from the orderer perspective

jyellick
2016-04-20 17:42
@simon: With respect to synchronous vs asynchronous, I agree that within some unit of code, call it module say, that it makes sense to keep everything entirely synchronous. The fact that it is possible for multiple handlers to have threads active inside PBFT at the same time today is a real problem, and there's just too much surface area for bugs to get in, we need to fix this. The intent with the fabric code from a design perspective tries to follow the actor pattern, and it seems like it would be a good solution to this for PBFT. Have the external methods simply queue messages into channels, then have a single thread dedicated to PBFT which selects across those channels, then performs work, and repeats. This should eliminate all the pbft internal race bugs because everything would now be done on a single thread. That's what I would like to implement for #919 #973

simon
2016-04-20 17:43
then let's make pbft entirely event driven

simon
2016-04-20 17:44
and a shim that allows communication with the rest of the stack

jyellick
2016-04-20 17:44
The piece where I think I diverge from slightly from you is whether execution should be synchronous to PBFT as well. It could obviously be done either way, but I'd be in favor of keeping the asynchronous execution model, to make a later split easier.

simon
2016-04-20 17:44
that shim would also convert timeouts to events

simon
2016-04-20 17:44
no, i agree that execution does not need to be synchronous

jyellick
2016-04-20 17:45
So would you propose using something like the FSM package that is used in peer?

simon
2016-04-20 17:45
in an event driven system, the fact that the stack executed a transaction would be another event, i guess

simon
2016-04-20 17:45
that FSM package seems odd to use

simon
2016-04-20 17:45
lots of strings

jyellick
2016-04-20 17:46
Yes, that was my impression as well, and the fact that there are strings makes me think it's being implemented as reflection under the covers, which also seems slow.

simon
2016-04-20 17:46
then pbft could send another execute

simon
2016-04-20 17:47
i.e. execution would be async, but always only one execution outstanding

simon
2016-04-20 17:47
and while we're at it, we should do the same to state transfer

simon
2016-04-20 17:48
i mean the conversion to events

jyellick
2016-04-20 17:53
So help me understand this a little better. The thing I like about the current execution model, is that in general, PBFT can treat executions as incapable of failing and atomic, even though they are not. (Obviously this is not true for Sieve, but that is a different discussion). If an execution can't be performed, say, because a state transfer is pending, then from PBFT's perspective, it doesn't have to care. The only callback standard PBFT needs is the periodic checkpoint messages. Once you switch to a model where there is only every one execution outstanding, I suppose it's PBFT who buffers the executions in the case of something like state transfer. From a future split perspective, it seems like a queue of requests would be more scale-able, having to wait for a network round trip in between every execution seems problematic.

simon
2016-04-20 17:53
i don't think there will ever be such a split

simon
2016-04-20 17:54
realistically speaking

jyellick
2016-04-20 17:55
I thought this was the whole endorser sort of model? Where there would be consensus as a service which is simply doing ordering, but needs to send the transaction off for execution at other nodes?

simon
2016-04-20 17:55
in that case the execution would be "add this block to the list of blocks"

simon
2016-04-20 17:55
not execution of transactions

simon
2016-04-20 17:56
i see your point though

simon
2016-04-20 17:56
but i'd prefer to have it absolutely stable before making it faster

jyellick
2016-04-20 17:57
Yes, stability certainly takes preference over speed

simon
2016-04-20 17:57
now that i hacked the code, i realize that i did have a separate thread for executions

simon
2016-04-20 17:58
i guess that's what your executor did - just that i absolutely had trouble following it

jyellick
2016-04-20 17:58
Yes, the executor removed that thread and added its own (or maybe more accurately, just moved it to its own file and structure)

jyellick
2016-04-20 17:59
So assuming we go to this event driven FSM type model

jyellick
2016-04-20 17:59
How do you see the plugins working with this?

jyellick
2016-04-20 18:00
Do they run their own separate FSM, or do they somehow attempt to extend the underlying PBFT one?

simon
2016-04-20 18:00
i don't know

simon
2016-04-20 18:00
it's all a mess

simon
2016-04-20 18:01
probably they should run their own fsm

simon
2016-04-20 18:01
or maybe not?

simon
2016-04-20 18:01
they're so tied to it

simon
2016-04-20 18:01
e.g. classic

jyellick
2016-04-20 18:02
The plugin piece is what has been making me scratch my head. I think the core PBFT actually wouldn't be that difficult to clean up. And I think batch and classic really would be trivial to move to their own FSM.

jyellick
2016-04-20 18:02
But Sieve actually cares about PBFT internal state, like view, and timers, which is much more difficult.

simon
2016-04-20 18:02
but what kind of fsm would that be, for classic

jyellick
2016-04-20 18:05
I think part of the problem too is maybe that the `innerStack` interface should be broken down. There's no reason classic and batch should be implementing different execute verify sign etc. methods. So a FSM for classic would be pretty trivial, I'm not even sure it needs more than 1 state. Its thread listens for incoming messages, if one arrives, it delivers, and goes back to listening.

jyellick
2016-04-20 18:06
For batch, you would only have slightly more. You're in that listen state, you wait for either a timer event, or the nth message, then inject. Neither of them care about any of the rest of the innerstack interface.

simon
2016-04-20 18:06
right

simon
2016-04-20 18:07
i don't think the plugin should have its own thread

simon
2016-04-20 18:07
some state machine service should have that thread

simon
2016-04-20 18:07
and only delivers events to the state machine

jyellick
2016-04-20 18:09
So a plugin would register event types (and maybe additional event sources) it's interested in?

simon
2016-04-20 18:09
or it just receives them

jyellick
2016-04-20 18:09
How would you handle if the core and a plugin were both interested in the same sort of event? Plugin supersedes?

simon
2016-04-20 18:09
and optionally ignores them

simon
2016-04-20 18:10
i think an event always goes to a specific fsm

simon
2016-04-20 18:10
in the context of FSMs

simon
2016-04-20 18:10
we're far away from an event bus

jyellick
2016-04-20 18:14
I guess I'm still a little fuzzy on how say, Sieve, would be implemented in this scheme. There's the core PBFT FSM, and it's fairly obvious how classic or batch would be implemented on top of that. For Sieve, it could certainly inject a request into the core state machine, but how does it deal with all its other state?

simon
2016-04-20 18:15
i don't know

simon
2016-04-20 18:15
if sieve and the pbft core (as its sub-fsm) execute within the same fsm context

simon
2016-04-20 18:15
then sieve could still access the pbft internals (not saying it is a good thing)

simon
2016-04-20 18:15
not modify them

simon
2016-04-20 18:15
but inquire

jyellick
2016-04-20 18:17
Just as another thing to keep in mind, speaking with @chetsky he would like a way to publish PBFT state via an external gRPC interface. Essentially to support dashboarding, so that a) developers can more easily debug problems without having to crawl through logs b) operators can easily verify that their systems are functioning properly and making progress.

simon
2016-04-20 18:19
what kind of state would that be?

jyellick
2016-04-20 18:19
(your mentioning being able to inquire without modifying state reminded me of this)

jyellick
2016-04-20 18:21
Well, we would pick what is most useful, but I think certainly things like `lastExec`, our watermarks, if we believe we are out of sync, what view we are in, if it is active.

simon
2016-04-20 18:21
formalized logging

jyellick
2016-04-20 18:21
Ultimately for operators we'll want to boil this down to a green/red type thing, 'this node is functioning properly' or 'this node seems to be in trouble'

jyellick
2016-04-20 18:22
But yes, formalized logging seems like a reasonable interpretation

jyellick
2016-04-20 18:24
Ultimately he would like to see it become a more system wide thing, to be able to inspect the state of the ledger, what chaincodes are deployed, etc.

simon
2016-04-20 18:27
ok, i gotta go soon

jyellick
2016-04-20 18:28
Okay, so getting quickly back to #919 #973, about what can be implemented in the short term. It's certainly possible to convert all of the message reception into a channel, so that we can select across our view change timer, or a message received channel.

jyellick
2016-04-20 18:29
It also wouldn't be difficult to convert the Checkpoint callback to enqueue a message which could also be selected on.

simon
2016-04-20 18:30
but should we do stuff for short term?

simon
2016-04-20 18:30
because that's been sort of the trouble

jyellick
2016-04-20 18:31
Well, I guess with the whole agile model, we should try to accomplish something concrete in this sprint. There are a number of outstanding bugs that are killing our scale tests because we deadlock, and I think this would fix those, and I think it would put us a little closer to a real event driven fsm.

simon
2016-04-20 18:31
yes

simon
2016-04-20 18:32
do we know what deadlocks precisely?

jyellick
2016-04-20 18:32
I can enumerate some of them. One is that because we drop the lock around executions, the executions can actually be invoked out of order

jyellick
2016-04-20 18:33
Which causes the blockchains to diverge and pbft to stop making progress

simon
2016-04-20 18:33
ah what?

jyellick
2016-04-20 18:34
I have seen it in the logs, we'll get something like executing/committing seqNo 35 executing/committing seqNo 36 executing/committing seqNo 37 executing/committing seqNo 34 executing/committing seqNo 38

simon
2016-04-20 18:34
how is that possible

jyellick
2016-04-20 18:34
Well, maybe not that exactly line, sorry

simon
2016-04-20 18:34
yea

jyellick
2016-04-20 18:35
Let me get the real log

jyellick
2016-04-20 18:35
If I still have it...

jyellick
2016-04-20 18:36
Ah, damn, blew them away to run the behave tests

jyellick
2016-04-20 18:37
Yes, on line 785, we increment `lastExec`, then we drop the lock, and send the exec off

jyellick
2016-04-20 18:38
As soon as we drop that lock, another handler thread can come in, and do its own thing

jyellick
2016-04-20 18:38
And so if the thread happens to get unscheduled before we actually run the invoke, then things break, I suppose that might be fixed by simply setting the lastExec after we send the execute

jyellick
2016-04-20 18:39
But we drop and re-acquire these locks far to often, and it just invites these sorts of bugs

simon
2016-04-20 18:47
yes

simon
2016-04-20 18:47
absolutely agree

simon
2016-04-20 18:47
i think i made exec synchronous again

simon
2016-04-20 18:47
which breaks all soft of other things, because that deploy takes forever

jyellick
2016-04-20 18:49
Yes, it seems like deploys are going to need to be restricted in any sort of production system, they are such an easy DOS path


simon
2016-04-20 18:49
or we make deploy execution asynchronous

simon
2016-04-20 18:49
i'd love to get some feedback on this

simon
2016-04-20 18:50
ok, now i'm really out

simon
2016-04-20 18:50
byes

jyellick
2016-04-20 18:59
Was reviewing your code

chetsky
2016-04-20 19:06
we should discuss deploy. it's being done wrong today, b/c there's no in-state row for a deploy

jyellick
2016-04-20 19:08
@chetsky: Did I appropriately summarize your desires for the pbft state introspection interface?

chetsky
2016-04-20 19:09
uh, voice?

chetsky
2016-04-20 19:09
yes, that summary was spot-on

chetsky
2016-04-20 19:09
re: deploy

jyellick
2016-04-20 19:09
Above, I tagged you many messages above here, sorry, slack not great about that

chetsky
2016-04-20 19:09
I saw it

chetsky
2016-04-20 19:09
slack is at least good thatway

chetsky
2016-04-20 19:09
it ws fine

chetsky
2016-04-20 19:10
oof, incoming

chetsky
2016-04-20 19:10
will IM bakc

jyellick
2016-04-20 19:10
Alright

chetsky
2016-04-20 19:10
but deploy is all worng

tuand
2016-04-20 19:11
@binhn: @muralisr see above re: deploy

binhn
2016-04-20 19:11
has joined #fabric-consensus-dev

muralisr
2016-04-20 19:11
has joined #fabric-consensus-dev

sheehan
2016-04-20 19:12
@chetsky: are you talking about https://github.com/hyperledger/fabric/issues/1054 or something else?

chetsky
2016-04-20 19:42
right, but that doesn't go far enough

chetsky
2016-04-20 19:42
deployed chaincode needs to have a "state".

chetsky
2016-04-20 19:43
so for instance: "maintenance", "active"

chetsky
2016-04-20 19:43
but also, since apparently building the chaincode is taking nontrivial time, "installed"

chetsky
2016-04-20 19:43
erm, "committed"

chetsky
2016-04-20 19:43
committed == committed to state (sysibm.sysprocedures)

chetsky
2016-04-20 19:44
which causes a background thread to attempt to build the chaincode image

chetsky
2016-04-20 19:45
eventually, a tran would be run at ALL endorsers with some HIGH threshold for success, that would move the chaincode from committed to maintenance

chetsky
2016-04-20 19:45
in maintenance, only users with maintainer rights could invoke trans on it.

chetsky
2016-04-20 19:45
e.g. to call"init"

chetsky
2016-04-20 19:45
and then finaly, move it to "active" so all users with rights can access it

chetsky
2016-04-20 19:46
at a minimum, EVEN IF chaincode image creation is instant, we need maintenance and active

chetsky
2016-04-20 19:46
b/c a single init() call might not be enough to get teh chaincode ready for use

chetsky
2016-04-20 19:47
remember that an init() call is a tran like any other, so will be limited in how many rows it can modify

chetsky
2016-04-20 19:47
this is the equiv of: when I deploy a stored-proc, I need to create the tables it accesses, and load initial data into some of them

chetsky
2016-04-20 19:47
that isnt' guaranteedto fit in a 5M shellscript

muralisr
2016-04-20 19:50
@chetsky: the “deploy state” caught the eye…. don;t want to derail the rest of the discussion but the system chaincode (lifecycle, deploy, …we can decide what to call later) will *naturally* provide the way to create a “state” for deployed chaincodes.

muralisr
2016-04-20 19:51
does that fall in line with your thinking … again, didn;t want to derail the discussion….

chetsky
2016-04-20 19:51
the table sysibm.chaincodes will have schema

chetsky
2016-04-20 19:51
ccid <long-integer>, body blog, state {MAINT,ACTIVE}

chetsky
2016-04-20 19:51
primary key ccid

chetsky
2016-04-20 19:51
perhaps other colums, like a list of certs who are allowed to invoke maintenance operations on it

muralisr
2016-04-20 19:52
yep

muralisr
2016-04-20 19:53
and when the (deploy) system chaincode receives a tran to deploy the chaincode, it’ll manip. sysibm.chaincodes

chetsky
2016-04-20 19:53
right

chetsky
2016-04-20 19:53
the chaincode -buid- process will be -driven- by the state of sysbm.chaincodes and differences between that, and what's on-disk

muralisr
2016-04-20 19:54
right

muralisr
2016-04-20 19:57
basically... the system chaincode acts as a filter for deploy (and other trans) and being itself a chaincode (as opposed to code embedded in fabric as it is today) can access state naturally. Opens up avenues

muralisr
2016-04-20 19:58
@chetsky: did that capture it ? anything else you’d add/change ?

jyellick
2016-04-20 20:06
@simon: https://github.com/jyellick/fabric/pull/1/ You can see my comments on your branch here (sorry for the clunkiness of doing it in a PR by me against my own fork, I did not want to submit your code to the hyperledger fabric project as a PR to discuss if you didn't want it there yet), most of my concerns are around state transfer with the executor removed. I don't think they are beyond addressing, the pre #833 code handled some of these cases, but we are getting right back into having pbft and state transfer far too intertwined in eachother's workings, which was one of the principle reasons for pushing it into its own module.

jyellick
2016-04-20 20:13
I've actually been thinking that my biggest mistake with the executor split was attempting to create one executor to serve both sieve and classic/batch, when their execution models are so different. Sieve must only ever have one request in the queue at a time, so trying to use an asynchronous queue based system doesn't make a lot of sense there. Similarly, Sieve can provide a checkpoint to transfer to at every round of consensus, so the feeding of checkpoints from PBFT is largely superfluous. On the other hand, the executor removed a lot of the complexity of state transfer out of `pbft-core.go`. I think it would be possible to rip the sieve support components out of the executor, which would drastically simplify it, then have Sieve perform its own executions or implement a much simpler more sycnhronous executor for Sieve.

chetsky
2016-04-20 22:12
@jyellick might as well just rip out Sieve. It's a broken protocol anyway. No replayability, what's the point of its existence.

chetsky
2016-04-20 22:13
effort spent maintaining it is wasted effort

ghaskins
2016-04-20 22:34
@chetsky: I can't comment on sieves design or status, but I can say that consensus algorithms that perform EV are mandatory (to me) and I don't think PBFT provides that. At least, not in this context. What are your thoughts or that?

ghaskins
2016-04-20 22:34
The notion that sieve can be arbitrarily abandoned concerns me.

ghaskins
2016-04-20 22:36
What is the replay problem you mention by the way?


ghaskins
2016-04-20 22:37
looks

chetsky
2016-04-20 22:37
also, there's a note -in- this note, with subject "trading latency ..."


chetsky
2016-04-20 22:37
same general subject

chetsky
2016-04-20 22:38
the reason state-machine replication is needed that we execute chaincode 'everywhere'

chetsky
2016-04-20 22:38
that isn't scalable to begin with

chetsky
2016-04-20 22:38
once you ditch that assumption, you can make HL be a normal database with a cryptographically protected log

chetsky
2016-04-20 22:39
that way lies throughput

chetsky
2016-04-20 22:42
Re: "replay", sieve doesn't guarantee that when an auditor replays a tran, it will have the same effect as when it was committed. So it's entirely possible to be unable to audit a log by replaying it (and since logs don't contain state-deltas, there's no other way to audit a log)

chetsky
2016-04-20 22:48
by contrast, "MVCC+postimage" "locks down" one of the nondeterministic executions of a tran. So even if, at replay time, the tran does something different, the auditor can just apply the state-delta and move past that tran (unless they're actually interested inthe tran, in which case, they could investigate further the discrepancy, knowing 100% that it's just a non-replayability issue, not some other bug.

ghaskins
2016-04-20 23:12
So you are worried about a scenario where a given transaction was deterministic enough that 2f+1 validators computed the same hash...but later an auditor cannot reproduce the same result?

chetsky
2016-04-20 23:15
well, if we admit nondeterminstic trans, that's always a possibility.

chetsky
2016-04-20 23:15
whereas, with MVCC+postimage, we -know- what teh state-delta is.

chetsky
2016-04-20 23:16
sure, the replay of the tran might produce something different. But we still know what was applied at commit, and we cna apply it again

chetsky
2016-04-20 23:16
I gotta run (sister just had twins! time to visit hospital!) but we can talk another time.

ghaskins
2016-04-20 23:16
Ok, congrats!

ghaskins
2016-04-20 23:17
I totally understand the argument of MVCC as a vehicle for concurrency, but it seems some of the other issues are being conflated

ghaskins
2016-04-20 23:17
I'd like to understand more, so ping when you are back

jyellick
2016-04-21 00:25
@ghaskins: Just for clarification, in Sieve, it only takes f+1 votes of confidence in the same answer to be committed

jyellick
2016-04-21 00:30
@chetsky: With respect to ripping out Sieve, I certainly agree that Sieve makes no sense in a post MVCC world, and I'm all in favor of not maintaining code that does not have a long term home. Of course today, Sieve is the only option we have to tolerate chaincode which does not behave in a deterministic way, and I know @simon especially and others (including myself) have put a lot of time into making Sieve functional. I'd like to assume there is some sort of 'must have for this release' requirement that we have been trying to satisfy?

ghaskins
2016-04-21 00:30
@jyellick: i think that makes sense, or at least it doesnt jump out at me as wrong

ghaskins
2016-04-21 00:31
most of these designs fall somewhere in the range of f+1 to 2f+1 to be correct :wink:

ghaskins
2016-04-21 00:31
its not always clear when you can cheat back from 2f+1, but I know there are many opportunities to do so

jyellick
2016-04-21 00:32
Haha, very true, not a huge deal, unless you are truly interested in understanding the inner workings of Sieve. The high level "All the replicas will end up with the same state, even if transactions are behaving nondeterministically, and a byzantine replica will not generally be able to pick that state." is usually good enough.

ghaskins
2016-04-21 00:46
anyway, I dont have a strong opinion on what the consensus algorithm is as long as it performs at scale and verifies outputs

ghaskins
2016-04-21 00:47
if the mvcc work ends up being that and sieve gets deprecated, that is fine by me

ghaskins
2016-04-21 00:47
i would just hate to see pbft become the only option, as I dont think it will work for me

jyellick
2016-04-21 01:00
@ghaskins: To me, the crux of the problem, is that you can take the output of 10 million trusted nodes all running the same code, and get the same result, and that's not good enough to say that it's actually deterministic. It's very possible to prove non-determinism from transaction output, but never to prove determinism. So, you always run the risk that the auditor comes by and replays that transaction and says "Hey, why is my result different?". Now, you can certainly say, that you had a quorum which must have agreed on whatever the output was, but unless you analyze the input itself, you can't make that determinism guarantee. Which is why the MVCC promise is not that the underlying chaincode is deterministic, but the application of the transaction is.

ghaskins
2016-04-21 01:01
i understand that part: to play back..

ghaskins
2016-04-21 01:02
we agree that only deterministic results are valid, and consensus tries to ensure that at least enough nodes agree that the results were deterministic (among other criteria)

ghaskins
2016-04-21 01:02
but there is never a guarantee that operations truely are deterministic…we could have 2f+1 that says it was but the auditor is the unlucky one that found out it isnt

ghaskins
2016-04-21 01:03
so what we want to do is prevent this from somehow impacting replay/auditability

ghaskins
2016-04-21 01:03
or minimize it at least

ghaskins
2016-04-21 01:04
it seems that the argument is that mvcc is more resistant to negative impact of non-determinism because the state delta may still be applied even if the process that generated the state delta cannot be reliably reproduced?

jyellick
2016-04-21 01:06
Right, so, MVCC has the advantage that it knows the correct output when trying to come to consensus. Sieve has the much harder task of trying to determine the correct output from among a group of byzantine peers, then come to consensus.

ghaskins
2016-04-21 01:07
ok, i am not sure I fully understand the details on your last statement, but lets back burner that for a second

ghaskins
2016-04-21 01:08
the problem I am having (and this is not sieve/mvcc specific, but just general) is that I dont see the notion of “state” vs “state hash” as being a property that is exclusive to the consensus algorithm

jyellick
2016-04-21 01:09
In the MVCC model, the 'state hash' no longer needs to be in the block, because replaying the transactions is guaranteed to produce the same result.

ghaskins
2016-04-21 01:09
yes, there are differences “on the wire” and even “on the block”, but ultimately the notion of “state at a point in time” or “state deltas” seem to be data that either explicitly exists or could be synthesized at any point in time based on the fact that a “blockchain” is essential a record of exactly that

ghaskins
2016-04-21 01:10
so I am not seeing why mvcc buys this replay resistance over other models (at least ones that do EV)

ghaskins
2016-04-21 01:11
right, but “on the block” optimizations are different from the data existing or at least being capable of synthesis

ghaskins
2016-04-21 01:11
i agree on the block models differ, but the state should ultimately be available either way (it has to be)

jyellick
2016-04-21 01:12
The hope is that you could synthesize the a state delta from the transactions at any point in time, but with non-determinism that's not the case. Today, we discard the state deltas because of space considerations. So, you can know what the state hash was at a block, but unless you can replay those transactions successfully, unless you have a copy of that state, or retained the state deltas, it's possible in the worst case you can never recreate that state.

jyellick
2016-04-21 01:13
In MVCC the transaction is essentially a state delta, it is the keys, and their version, and what they are updated to.

ghaskins
2016-04-21 01:14
ok, lets walk a scenario though: say I have some state with 100 mutations

ghaskins
2016-04-21 01:14
start at mutation 0 with empty rows and grow over time to 100 versions

ghaskins
2016-04-21 01:15
and those 100 versions all had at least 2f+1 commit certificates, etc, were considered value

ghaskins
2016-04-21 01:15
valid

ghaskins
2016-04-21 01:16
then someone tries to audit the sequence, and when they get to block 50 there is some non deterministic function that cant reproduce the same result

ghaskins
2016-04-21 01:16
in one model, we stored 100 state hashes “in the block"

ghaskins
2016-04-21 01:16
in the other, we recorded 100 mvcc+postimage objects

ghaskins
2016-04-21 01:18
if I understand the argument, its that we are saying even though we cant reproduce the result at block 50, we can still apply block 50 because the mvcc data is sufficient to get to block 51

ghaskins
2016-04-21 01:18
is this right so far?

jyellick
2016-04-21 01:18
Yes, so far so good

ghaskins
2016-04-21 01:19
ok, so neither model is sufficient at eliminating the non-determinism (nothing is), but we can at least move on to block 51 with mvcc rather than stall the whole process

jyellick
2016-04-21 01:20
Yep, exactly.

ghaskins
2016-04-21 01:21
so what I dont see is why mvcc is providing an advantage: in theory, any of the other nodes should be able to provide/synthesize the necessary state if we ask them in either model

ghaskins
2016-04-21 01:22
i should be able to ask one of the other nodes for block 50, or ask for the delta between 49 and 50, or whatever, regardless of model

ghaskins
2016-04-21 01:22
right?

ghaskins
2016-04-21 01:22
or am I missing something

ghaskins
2016-04-21 01:23
just because the on block representation is just a hash doesnt mean the state no longer exists (conceptually, anyway)

jyellick
2016-04-21 01:23
Ah, so, if the transaction is truly non-deterministic, a node executing it again doesn't guarantee it will get the same result

ghaskins
2016-04-21 01:23
if we elided the state for space optimization or something, that is one thing

jyellick
2016-04-21 01:23
Consider a really silly chaincode which returns 0 until 2017, then returns 1 ever after.

ghaskins
2016-04-21 01:23
ok

jyellick
2016-04-21 01:24
If the transaction executes in 2016, everyone will agree the output is 0, and then in 2017 the auditor comes along and tries to get back to that state.

ghaskins
2016-04-21 01:24
got it, but that is true regardless of model

jyellick
2016-04-21 01:24
Well, in MVCC, the auditor would see, that when this was executed, 2f+1 nodes agreed that the output was 0

ghaskins
2016-04-21 01:25
i think they would see that either way, no?

jyellick
2016-04-21 01:25
And so although the auditor may not be happy about that, that's much better than the alternative that "2f+1 nodes agreed on that output"

jyellick
2016-04-21 01:25
So, in the non-MVCC case,there's no way to recover that the output was 0

jyellick
2016-04-21 01:26
You only know if you got the same output, not what the old output was

ghaskins
2016-04-21 01:26
block 50 would have a state-hash with 2f+1 signatures, and if you ask any of those nodes for the state corresponding to the hash, it should have 0 amongst its rows

jyellick
2016-04-21 01:26
(in the case of where they're different)

jyellick
2016-04-21 01:26
Ah, but nodes do not retain state for every block

jyellick
2016-04-21 01:26
So they say "I garbage collected that ages ago, I don't know what that state was, just what it's hash was"

ghaskins
2016-04-21 01:26
it sounds like this is more of a case of the ledger introspection features than a property of the why states are represented on the block

jyellick
2016-04-21 01:27
If we stored the state deltas indefinitely, then you could get replayability

ghaskins
2016-04-21 01:27
you could certainly argue that the internal representation is based on state deltas that are retrievably via state hash

ghaskins
2016-04-21 01:27
right

ghaskins
2016-04-21 01:27
and that is kind of what mvcc is doing, just at a higher level

jyellick
2016-04-21 01:28
No, that's the whole idea of a hash, it's one way, you can't reverse engineer a state from its hash

jyellick
2016-04-21 01:28
Once a state delta is garbage collected, it is gone forever, unless you can reproduce it by replaying the transactions

jyellick
2016-04-21 01:28
(Which is true, so long as your replay is determinstic)

ghaskins
2016-04-21 01:29
understood that you cant from the hash itself, but the on-block representation of a state-hash is there for the purposes of protocol efficiency, not data obscurity…the assumption would be that you should be able to query a hash back to a state value

jyellick
2016-04-21 01:30
There is no facility to do that today, and I don't think it's planned. You can ask for a state delta by block number, but by default I believe we only retain the last 500 of them.

ghaskins
2016-04-21 01:30
ok, thats fine… i didnt mean to comment on impl status, just conceptual understanding

jyellick
2016-04-21 01:30
Certainly retaining a whole copy of the state per block is probably infeasible. Retaining the deltas indefinitely is in a sense what MVCC is doing.

jyellick
2016-04-21 01:31
(But MVCC also brings some other benefits, like scale)

ghaskins
2016-04-21 01:31
my main point is, i dont think MVCC buys us this resistance, its the data model that buys it

ghaskins
2016-04-21 01:31
right, MVCC is useful in other capacities, like concurrency and thus scale

ghaskins
2016-04-21 01:31
you can have a state-delta strategy that is independent of the MVCC debate, thats all I am saying

jyellick
2016-04-21 01:32
Ah, yes, sorry for being dense, that's certainly true.

ghaskins
2016-04-21 01:32
I am not arguing that we should do something other than MVCC either, dont get me wrong

jyellick
2016-04-21 01:32
Well, one thing though

jyellick
2016-04-21 01:32
State deltas are per block.

jyellick
2016-04-21 01:33
What if I have 2 transactions in my block, and I replay them, and I get a different result.

jyellick
2016-04-21 01:33
Then I can see what the aggregate effect of the 2 transactions was, but, I can't actually determine which is behaving non-deterministically.

jyellick
2016-04-21 01:34
(Assuming they both modify the same values)

ghaskins
2016-04-21 01:34
well to be clear, i dont think you want to compress transactions to the block level

ghaskins
2016-04-21 01:34
i envision the finest granularity for the hash is the transaction level, a block would just encompass N transactions (and thus N hashes)

nycnewman
2016-04-21 01:34
has joined #fabric-consensus-dev

jyellick
2016-04-21 01:34
Yes, you could implement it this way, but that is not the current block implementation.

ghaskins
2016-04-21 01:34
sure, thats fair

ghaskins
2016-04-21 01:35
again, just conceptual

ghaskins
2016-04-21 01:35
i actually havent had a chance to study what sieve is doing in reality

jyellick
2016-04-21 01:35
There's a paper around here somewhere I could probably dig up if you are interested, or I could try to quickly walk you through it at some point if you'd like.

ghaskins
2016-04-21 01:36
but ultimately, i think even the on-wire/on-block stuff should never go coarser than transaction-level hash

ghaskins
2016-04-21 01:36
i assume you are talking about the sieve paper? if so, link apprecated

ghaskins
2016-04-21 01:36
i saw it go by while I was traveling a few weeks ago but I have lost track of the link

jyellick
2016-04-21 01:38
I believe this is the paper http://arxiv.org/abs/1603.07351

jyellick
2016-04-21 01:38
@simon Would be a better person to ask

ghaskins
2016-04-21 01:38
awesome, thank you

jyellick
2016-04-21 01:39
No problem

ghaskins
2016-04-21 01:40
going to go eat dinner, thanks for the chat

ghaskins
2016-04-21 01:40
will talk more

jyellick
2016-04-21 01:40
No problem, always happy to help, just let me know

chetsky
2016-04-21 02:56
MVCC does not offer "replay resistance" in the way tnat Bitcoin does. But there's a minor extnesion to the design, that -does- offer such

chetsky
2016-04-21 02:58
"just because the on block representation is just a hash doesnt mean the state no longer exists (conceptually, anyway)" ... Actually, YES

chetsky
2016-04-21 02:58
you would NOT want to have a database that kept full per-tran state-snapshots around. That would be .... immense

chetsky
2016-04-21 02:58
uncontainably immense

chetsky
2016-04-21 02:59
something that's not clear is: MVCC+postimage is -just- the way normal databses do it

chetsky
2016-04-21 03:00
"you can have a state-delta strategy that is independent of the MVCC debate" .... if you keep only postimages (== state deltas) you cannot provide a reasonable concurrency-control model.

chetsky
2016-04-21 03:00
you might wish to look at Alonso & Kemme, on the Dragon system

chetsky
2016-04-21 03:00
Christian Cachin pointed out that what I proposed is effectively the same as what they did in that work

chetsky
2016-04-21 03:07
The MVCC information corresponds to the locks that the tran woud take, and the postimage data corresponds to the REDO information in a typical transaction log.

chetsky
2016-04-21 03:07
we're merely doing what databases do.

ghaskins
2016-04-21 03:10
@chetsky: i think we are saying the same thing

ghaskins
2016-04-21 03:10
when I said state exists, i meant logically

ghaskins
2016-04-21 03:11
as in, “state” can be a synthesis of a composition of deltas

ghaskins
2016-04-21 03:12
and that is exactly my point: the structure that mvcc in terms of state deltas isnt really any different than what another design would be doing as well

chetsky
2016-04-21 03:19
this is probably easier to discuss interactively by voice than in a chat . But actually, no, MVCC+postimage provides "proper concurrency control", whereas state deltas alone do not.

chetsky
2016-04-21 03:19
this is where the Dragon paper might be useful

chetsky
2016-04-21 03:20
I will not claim that I was inspired by that work, but it's fair to say that they're prior, and their design considerations were similar to what drove me in this MVCC+postimage design

chetsky
2016-04-21 03:20
MVCC is the locks that would be acquired in a standard-issue lock-manager-based DB

chetsky
2016-04-21 03:20
postimage is -just- what would go in the tran-log

chetsky
2016-04-21 03:21
and the combination of MVCC+postimage allows us to eschew the lock-manager. I can explain sometime more fully by voice

chetsky
2016-04-21 03:22
I'll put it this way: not that you'd want to do this, but you could imagine an application that does queries on a peer, and paints a screen; the application would ask the peer to take a (non-durable) snapshot, aaginst which the queries were run. Say, a trading screen.

chetsky
2016-04-21 03:23
later, when the trader presses "buy", the tran would be "simulated" against the actual state-snapshot that was used to paint the screen; then, it would be re-simulated against the current state of the peer, and if the MVCC+postimage info differed, the trader would get a "stuff changed while you were getting coffee; here's a repaint, do you still want to buy?" message

chetsky
2016-04-21 03:23
this is simply not possible without version-numbers and MVCC information

chetsky
2016-04-21 03:24
similarly, one can be validating trans in parallel and know that if they're valid, they can be sequenced in any order, and as long as all replicas apply those trans in that same order, eacn tran will either execute as it did when it was validated, OR it will abort.

chetsky
2016-04-21 03:24
these properties are analogous to the properties you'd get in a normal database

chetsky
2016-04-21 03:24
and NOT what you'd get in ethereum

chetsky
2016-04-21 03:25
(for instance)

simon
2016-04-21 09:00
jyellick: i finally understood what your executor is about

ghaskins
2016-04-21 10:32
@chetsky: I hear you, and understand. My only point was that you don’t need the entire MVCC+postimage design as outlined in your notes to achieve replay-ability of the transaction log…I can’t speak for sieve specifically but most state-hash based designs that I can think of only use the state hash as a representational convenience (for instance, cheap equivalency tests and on-the-wire compression). The underlying persistence layers would still be some form of a log of deltas and thus offer the same basic ability in a replay the log in auditing scenarios. I wanted to avoid conflating the issues with the solution, thats all.

ghaskins
2016-04-21 10:34
To be clear, I am not saying I am against your design, just pointing out that I think that particular problem can be solved in many ways.

simon
2016-04-21 11:13
we currently have deltas, but they are locally generated, like the blocks themselves

simon
2016-04-21 11:13
and the deltas are not cryptographically linked in the blocks

simon
2016-04-21 11:25
i would prefer to assemble the blocks before sending them through consensus

ghaskins
2016-04-21 11:40
you can, though the problem that I have seen with that approach is that a protocol restart due to non-determinism can be expensive

ghaskins
2016-04-21 11:40
like, how EVE does a viewchange+serial execution when that happens

ghaskins
2016-04-21 11:41
my preference is to have the protocol converge on block inventory rather than use a block as a core negotiating state

ghaskins
2016-04-21 11:41
ala ripple

ghaskins
2016-04-21 11:42
not sure how sieve does it, but I envision that is is closer to EVE than ripple

simon
2016-04-21 12:02
could you explain more?

ghaskins
2016-04-21 12:28
what I mean generally is: start with the notion that transactions flow in to a given peer, some of those will be “valid" and some will not

ghaskins
2016-04-21 12:29
what constitutes valid is of course a complicated criteria

ghaskins
2016-04-21 12:29
some validity checks can be done locally, others are related to whether there is agreement

ghaskins
2016-04-21 12:30
but generally speaking, the node can immediately discard the ones it doesnt believe are valid, and keep the ones that it does think are valid in a “proposed” state

ghaskins
2016-04-21 12:30
and then consensus can whittle that set of proposed transactions down

ghaskins
2016-04-21 12:30
so the set in a block is converged upon, rather than the block being negotiated up front

ghaskins
2016-04-21 12:31
so, maybe 20 come in, 5 are discarded right away, and eventually 10 are agreed to go in the next block, and 5 are deferred to the next round

ghaskins
2016-04-21 12:33
as opposed to taking 20 in the assumed block and then restarting the block process when there is a discrepancy (which is how EVE works, generally)

simon
2016-04-21 12:39
yes

simon
2016-04-21 12:39
EVE is just about optimistic concurrency

simon
2016-04-21 12:39
bitcoin does the same - collect transactions into blocks, consensus accepts blocks

ghaskins
2016-04-21 12:40
right, though I dont think the optimism is bad…what is bad is the penalty for non-determinism is relatively high (viewchange+serial)

ghaskins
2016-04-21 12:41
what I prefer is a design that assumes some non-determinism might occur and make it less exceptional to handle

ghaskins
2016-04-21 12:42
the trade off is that the overall protocol might have more overhead than the pure optimism approach, but the benefit is that non-determinism is just handled naturally as part of that rather than causing a hiccup

ghaskins
2016-04-21 12:42
so you eliminate that hiccup as a DoS attack vector

ghaskins
2016-04-21 12:42
or mitigate it is a better term

ghaskins
2016-04-21 12:43
non-deterministic results just get deferred/dropped (according to TTL rules) rather than cause a slower mode of operation

simon
2016-04-21 13:00
ripple is pessimistic by default

simon
2016-04-21 13:00
i don't think that ripple would outperform

simon
2016-04-21 13:01
with mvcc, the burden of proof is on the submitter

simon
2016-04-21 13:01
and only if the submitter can collect enough signatures from endorsers, the transaction would even make it into a block

simon
2016-04-21 13:02
and at that point, the postimage is already defined, and everybody now only applies the state delta, instead of executing the transaction again

ghaskins
2016-04-21 13:09
I dont have any objection to the proposal per se, though I was a little concerned about the sequence

ghaskins
2016-04-21 13:10
I was merely pointing out you don’t need the proposal to provide log replay capability

simon
2016-04-21 13:11
no, you just need to keep the deltas

ghaskins
2016-04-21 13:13
right

ghaskins
2016-04-21 13:14
actually, the requirement is that any given state is retrievable, but practical matters probably dictate that is stored as deltas

simon
2016-04-21 13:14
why does any given state have to be retrievable?

ghaskins
2016-04-21 13:14
logically

ghaskins
2016-04-21 13:15
if you cant reproduce the state at position X, you at least need to be able to retrieve the state at position X

ghaskins
2016-04-21 13:16
thats all we are talking about here

simon
2016-04-21 13:17
i don't understand

simon
2016-04-21 13:17
you're saying that an auditor needs to be able to check the correctness of the blockchain

ghaskins
2016-04-21 13:20
what I am saying is that normally a PBFT-style blockchain is valid if each entry in the chain has a valid commit certificate with the requisite 2f+1 sigs and don’t need to be able to re-run past transactions…but if you are something like and auditor and want to re-run each transaction but are unable to achieve the same result, you have two choices: stall, or accept the previously accepted value

simon
2016-04-21 13:20
yes

ghaskins
2016-04-21 13:21
for the former, you need to retrieve that state

ghaskins
2016-04-21 13:21
sorry, for the latter

simon
2016-04-21 13:21
well or you just apply the delta

ghaskins
2016-04-21 13:21
applying a delta is an implementation detail

ghaskins
2016-04-21 13:22
the logical operation is accepting the committed state

simon
2016-04-21 13:22
i guess that depends on how you see it

simon
2016-04-21 13:23
in bitcoin, the blockchain does not talk about state, but only about transactions (deltas)

simon
2016-04-21 13:23
and the state is purely local and ephemeral, and only for speed

simon
2016-04-21 13:24
in fabric, it is more databasey

ghaskins
2016-04-21 13:24
i think its largely semantics

simon
2016-04-21 13:25
yes

ghaskins
2016-04-21 13:25
you could say “the state of the system at blockheight X” which is the aggregate of all mutations before it

ghaskins
2016-04-21 13:25
thats all I am really referring to

ghaskins
2016-04-21 13:26
not literally a chaincode Get/PutState per se

simon
2016-04-21 14:30
jyellick: around?

jyellick
2016-04-21 14:30
Yes

simon
2016-04-21 14:30
so i'm trying to add persistence

simon
2016-04-21 14:31
i can add the sequence number to the consensus metadata

simon
2016-04-21 14:31
but what happens to the checkpoints?

simon
2016-04-21 14:31
i guess i can persist the checkpoint info

simon
2016-04-21 14:32
elsewhere

simon
2016-04-21 14:32
but it seems so odd to use checkpoints when we have a blockchain

jyellick
2016-04-21 14:32
This is in the executor-less branch?

simon
2016-04-21 14:32
yes

jyellick
2016-04-21 14:33
I don't think you can add sequence number to the consensus metadata, this is how things used to work, but had to be undone in order to support sieve, which wanted the 'checkpoint value' (block hash) at speculative execution time, before final ordering.

simon
2016-04-21 14:34
right

simon
2016-04-21 14:34
i think it will work

jyellick
2016-04-21 14:35
But the consensus-metadata is included in the blockhash? How do you get around that?

simon
2016-04-21 14:35
yes, it will

simon
2016-04-21 14:36
i thought we could include the validate set

simon
2016-04-21 14:36
but that's not correct

simon
2016-04-21 14:36
honestly, i prefer to send the execute via pbft as well

simon
2016-04-21 14:36
then *all* of this goes away

jyellick
2016-04-21 14:38
Yes, I have always thought it would make far more sense to do the execution and rollback as a 4th phase or second round of PBFT, but I thought @vukolic said there was something subtly wrong with this?

simon
2016-04-21 14:39
what do you mean, do the execution as 4th phase?

simon
2016-04-21 14:40
it would be pbft(execute), outside(verify), pbft(verify-set), pbft(execute), ....

simon
2016-04-21 14:40
and verify-set and execute could be sent in one single pbft request

jyellick
2016-04-21 14:44
Yes, so what you describe there would be basically 2 rounds?

jyellick
2016-04-21 14:45
As far as a fourth phase, trying to think it out on the fly here, and running into troubles, but essentially to send something like a checkpoint out every round, which then is used to make a decision as to whether to commit, state transfer, or rollback.

jyellick
2016-04-21 14:46
I think the 2 ~phase~ rounds approach is more obviously correct, and I think would make Sieve a much simpler plugin.

jyellick
2016-04-21 14:47
It seems like it would even potentially return some of the windowing properties to Sieve which the current approach loses.

simon
2016-04-21 14:47
two rounds, yes

simon
2016-04-21 14:47
but strictly if you have enough requests queued, you can combine both rounds into one

jyellick
2016-04-21 14:47
Ah, sure,

simon
2016-04-21 14:48
you're basically saying "this is the result of the last execution, and that's the next execution"

simon
2016-04-21 14:48
then you're getting rid of race conditions between delivery of exec and other pbft messages

jyellick
2016-04-21 14:48
Right, yes, I like this approach.

simon
2016-04-21 14:49
yea i suggested that long ago, but for some reason we decided not to implement it that way

simon
2016-04-21 14:49
maybe just because "we have it the other way right now"

jyellick
2016-04-21 14:53
Is this something you're going to implement now?

jyellick
2016-04-21 14:55
I guess the comment from @chetsky worries me a bit, since Sieve will not really be relevant in an MVCC world, I hate for us to spend too many cycles on it. Do you know what the requirements around Sieve are for this first release?

simon
2016-04-21 14:56
until mvcc arrives, i think we need sieve

simon
2016-04-21 14:56
i'd be the first to throw it out

simon
2016-04-21 14:56
if we didn't have to deal with non-determinism

simon
2016-04-21 14:58
ok, i'll head outside for a bit and join you in the scrum

jyellick
2016-04-21 14:58
Maybe this is something to bring up to Sharon, is tolerating non-determinism but losing replay-ability okay? I think that was Chet's big dismissal

tuand
2016-04-21 14:59
bring up to binh/marko/chet ?

kostas
2016-04-21 15:21
in https://github.com/hyperledger/fabric/issues/925 have we agreed on what the acceptable behavior should be?

kostas
2016-04-21 15:22

kostas
2016-04-21 15:22
you got f=2 so all bets are off

kostas
2016-04-21 15:22
but I am guessing that since we are keeping it alive and assigning it, we have a take on the expected behavior

chetsky
2016-04-21 15:58
sorry guys, back (too many calls, kill me): crash-fault-tolerance is a non-negotiable. Frankly, "tolerating non-determinism"? I'd punt on it, if it's a choice. B/c we know two different ways of -fixing-it for good

simon
2016-04-21 16:02
chetsky: bihn said it is important

harshal
2016-04-21 16:29
has joined #fabric-consensus-dev

simon
2016-04-21 17:46
okay, i can now restore persisted state (except for requests)

simon
2016-04-21 17:46
still needs to be tested with the outer plugins, but I don't see any problems there

simon
2016-04-21 17:46
i can stop a primary and restart it, and everything proceeds as normal

chetsky
2016-04-21 18:15
@simon: more important than crash-fault-tolerance? really? *grin*

chetsky
2016-04-21 18:15
crash-fault-tolerance, and being able to catch-up lagging peers, is non-negotiable

chetsky
2016-04-21 18:15
sieve? -whatever-

jeffgarratt
2016-04-22 04:41
has joined #fabric-consensus-dev

simon
2016-04-22 07:07
chetsky: absolutely more important.

simon
2016-04-22 07:08
chetsky: one non-deterministic transaction -> whole network broken

simon
2016-04-22 07:08
chetsky: one node crashes -> restart from 0 using state transfer

simon
2016-04-22 14:06
i have an odd race condition in the mock network code

simon
2016-04-22 14:06
with isBusy

simon
2016-04-22 14:06
somehow the goroutine executing is not considered as busy

simon
2016-04-22 14:06
but why?

jyellick
2016-04-22 14:07
Isn't your check based on a timer's activity? I did not understand that

simon
2016-04-22 14:08
oh

simon
2016-04-22 14:09
yes, timer active -> still busy

simon
2016-04-22 14:09
once everything is event driven, it would just be len(events) > 0


jyellick
2016-04-22 14:10
The timer is being stopped in the execute, but certainly before the core is necessarily idle

simon
2016-04-22 14:10
of course execution being asynchronous, there is still some issues, because this essentially is an unfinished rpc - we expect an event to arrive later

simon
2016-04-22 14:10
yes, and I set currentExec

simon
2016-04-22 14:11
isbusy checks for both

simon
2016-04-22 14:11
that's why I don't get why it can race

jyellick
2016-04-22 14:12
But it's not set atomically? `currentExec` is not set until 3 lines after the timer stops

jyellick
2016-04-22 14:12
(unless it's already set for some reason, I don't entirely understand this flow)

simon
2016-04-22 14:13
oh sorry

simon
2016-04-22 14:13
this all happens with the lock taken

simon
2016-04-22 14:13
i changed isBusy to take the lock

simon
2016-04-22 14:14
what don't you understand about it?

simon
2016-04-22 14:14
i should rewrite it so that it is more clear

jyellick
2016-04-22 14:15
I don't think it's necessarily fair to say it's unclear, I've only read through your changeset maybe twice, and it is fairly large, just have not had time to understand it yet

simon
2016-04-22 14:15
yea :confused:

simon
2016-04-22 14:16
should add more comments then

cca
2016-04-22 15:59
hi, wanted to chat about what is "the blockchain"

cca
2016-04-22 16:00
so, regarding the discussion to split endorsement from consensus

cca
2016-04-22 16:00
there are "endorsers" per chaincode and a consensus service (done by consenter nodes)

cca
2016-04-22 16:02
now the output form the endorsers is a tx with its ops on state (read / write) and versions in which these take place, basically the writes & reads should only be appended to ledger when ledger has still same versions as when endorser saw it

cca
2016-04-22 16:02
then this is sent through the ordering service implemented by the consensus service

cca
2016-04-22 16:03
what comes out there are ordered and "valid" (according to the endorsers) tx and their ensuing state changes

cca
2016-04-22 16:04
any peer can now consume this and will follow the recipe that it appends them to the ledger and affects the state -when endorsing sigs are valid according to policy -and versions match the expected ones

cca
2016-04-22 16:05
The peer can then update its hash chain, by computing root = H(previous root, tx descr, ops, versions)

kostas
2016-04-22 16:07
correct

kostas
2016-04-22 16:08
do we agree that this hash chaining cannot happen anywhere else?

cca
2016-04-22 16:09
it is not possible to get it elsewhere because: if there is to be only one chain, then this must be the chain "consented on", that is,

cca
2016-04-22 16:09
the one that comes out of consensus

cca
2016-04-22 16:09
in that sense, yes, i agree

kostas
2016-04-22 16:10
so going back to the emails I sent, would you say we're in agreement? or is there something that's missing

cca
2016-04-22 16:10
one sec

cca
2016-04-22 16:11
yes, we agree. to find out what is the "correct" blockchain, i go to a peer that I trust and ask for the hash

cca
2016-04-22 16:11
then i can verify everything back from there, from this hash.

cca
2016-04-22 16:12
the feature that anyone else who goes to "his" peer should see the same ledger is then ensured from this, because his peer will also give the same hash or an extension of it

cca
2016-04-22 16:13
the other feature of blockchain should be that from the info on the blockchain (= hashed) one can recreate the current state. the above info allows this, assuming that also the deployed tx (their source) are included

kostas
2016-04-22 16:13
and this peer in turn can find out if their blockchain is right, by querying the endorsers for the blockchain hash on the topmost block (and hoping for a quorum) - correct?

cca
2016-04-22 16:13
ah, no, the endorsers are now out

cca
2016-04-22 16:14
the endorsers would not know the topmost block better than any node that listens to the output of the (abstract) consensus service

cca
2016-04-22 16:14
if the consensus service is implemented by BFT, then the peer will go to 2f+1 and ask for their current block

kostas
2016-04-22 16:14
that is not correct, because when you ask the endorsers, you say give me the "blockchain hash of block 12" (assuming 12 is *your* top-most block)

kostas
2016-04-22 16:15
and the endorsers maintain state and can of course return that info

cca
2016-04-22 16:15
aha, i see ...

cca
2016-04-22 16:16
but an endorser would not play a special role here, for returning a block hash or storing it. any peer would do that, not only endorsers

kostas
2016-04-22 16:16
yeah, that is a valid point

kostas
2016-04-22 16:16
you don't have to focus just on the endorsers

cca
2016-04-22 16:16
i said, the endorser does not know *better*

cca
2016-04-22 16:17
ok, thanks for the discussion. i can turn this over into the design docs

kostas
2016-04-22 16:17
sure thing. if you replace "endorsers" with "peers" in my last emails then everything should be exactly the same.

cca
2016-04-22 16:17
ok, thanks

simon
2016-04-22 16:33
cca: i would prefer that consensus happens on complete blocks

simon
2016-04-22 16:34
and ideally that during consensus incorrect blocks are rejected

simon
2016-04-22 16:34
so whatever happens to come out of consensus is the blockchain, without having to apply some special logic to filter out transactions/blocks

jamie.steiner
2016-04-23 04:20
has joined #fabric-consensus-dev

nits7sid
2016-04-23 14:26
has joined #fabric-consensus-dev

latone
2016-04-24 00:40
has joined #fabric-consensus-dev

howardwu
2016-04-25 03:38
has joined #fabric-consensus-dev

nits7sid
2016-04-25 08:54
Is there any restrictions on Block size?Like in Bitcoin blockchain there is 1MB per block and it handles only 7transactions per seconds... what in case of OBC?

simon
2016-04-25 10:05
you can configure how many transactions go into a block

wimtobback
2016-04-25 11:38
has joined #fabric-consensus-dev

nits7sid
2016-04-25 12:11
and also the timeout ?

simon
2016-04-25 12:20
yes

nits7sid
2016-04-25 12:42
what happens when a peer on which chaincode is deployed goes down?

simon
2016-04-25 12:43
crashes, you mean?

nits7sid
2016-04-25 12:43
Yes

simon
2016-04-25 12:44
then everything proceeds as normal

simon
2016-04-25 12:44
unless more than f replicas don't answer

simon
2016-04-25 12:44
then the network stops processing requests until there are 2f+1 replicas operating correctly again

nits7sid
2016-04-25 13:01
Ohh okay!

nits7sid
2016-04-25 13:06
In case of obcpbft/config.yaml what does N value signify?

mcrafols
2016-04-25 13:06
number of peers in the network

jyellick
2016-04-25 14:27
@simon: Are you close to merging your crash fault tolerance work? Was going to rebase my single-threading work onto it if so

simon
2016-04-25 14:27
oh, i didn't see that work

simon
2016-04-25 14:27
am i tracking the wrong repo?

simon
2016-04-25 14:27
yes, i'm close

jyellick
2016-04-25 14:27
No, I just have done it in a few different branches and not pushed anything

simon
2016-04-25 14:28
ah

simon
2016-04-25 14:28
i'm having problems with state transfer

simon
2016-04-25 14:28
i fixed a couple of bugs

simon
2016-04-25 14:28
now trying again

jyellick
2016-04-25 14:29
https://github.com/corecode/fabric/tree/revert-executor <- the correct branch to rebase onto?

simon
2016-04-25 14:29
i just pushed a rebased version

simon
2016-04-25 14:29
because of jeff's changes

simon
2016-04-25 14:30
could you push your current changes?

jyellick
2016-04-25 14:34
I will try to get them pushed soon, rebasing may take me some time. In your code I see this: ``` go func() { instance.consumer.execute(idx.n, req.Payload) http://logger.Info("Replica %d finished execution %d, trying next", instance.id, idx.n) instance.lock() defer instance.unlock() instance.execDone() }() ```

jyellick
2016-04-25 14:34
What guarantees multiple of those goroutines from being started at the same time? And if nothing, what guarantees they execute in order?

simon
2016-04-25 14:35
executeOne() will only run again if currentExec is nil

simon
2016-04-25 14:35
which is set to nil by execDone()

simon
2016-04-25 14:35
it's all an ad hoc state machine - i'd prefer if all of this was formalized using a declarative state machine

jyellick
2016-04-25 14:36
Okay, thanks

simon
2016-04-25 14:36
do you have a suggestion how to make this more explicit and less open-coded?

jyellick
2016-04-25 14:41
Not off the top of my head, we just had that outstanding bug around dropping the lock before executions which was causing our executions to occur out of order

simon
2016-04-25 14:41
i guess this could go into a mini-executor

simon
2016-04-25 14:41
yea

jyellick
2016-04-25 14:43
Yes, it might be worth factoring out, `pbft-core.go` is complex enough as it is.

simon
2016-04-25 14:43
right

simon
2016-04-25 14:44
doesn't have to go into a separate package, but a separate "object" may be a good idea

simon
2016-04-25 14:44
oh I realized that many patterns work well with embedding

simon
2016-04-25 14:44
I didn't know enough go early on to use embedding effectively

simon
2016-04-25 14:46
e.g. the omniproto could be used as a base for embedding

simon
2016-04-25 14:46
instead of using these "impl" function pointers

simon
2016-04-25 14:50
```14:47:22.035 [consensus/obcpbft] restoreState -> INFO 036 Replica 3 restored state: view: 0, seqNo: 31, lastExec: 0, pset: 1, qset: 0, reqs: 31```

simon
2016-04-25 14:50
why lastExec 0?

simon
2016-04-25 14:51
and then this: ```14:47:39.387 [consensus/statetransfer] tryOverPeers -> WARN 18a name:"vp3" in tryOverPeers loop trying name:"vp0" : name:"vp3" got block 37 from name:"vp0" with hash 81e3da4e5fe0d18f2a7fd662e8e982e1e7ab5ab5a36e73587f97b772f0546370424a6b7eee8f60304d37faf07171e3466506349995596e6ae7c28bbf782b5ac9, was expecting hash 763d71e4e31a313973c000ac68d0414f95ff87bc6efe9f66c662a880b81381d11d3a4b7fd18dcea25590adec5f8eee74fae9c0cc63f1e878cab4250cf9badb6b```

simon
2016-04-25 14:51
:disappointed:

simon
2016-04-25 14:53
i don't get it

simon
2016-04-25 14:54
i need an offline block explorer

simon
2016-04-25 14:56
DUH

simon
2016-04-25 14:56
i didn't update batch yet

simon
2016-04-25 15:25
wrong hash, why

simon
2016-04-25 15:26
how do i even start to debug this

simon
2016-04-25 15:26
dreams of distributed gdb

jyellick
2016-04-25 15:30
I usually find that things are an off by one error, try grep-ing for 'wrong' hash in the logs? You might find it belongs to an adjacent block?

simon
2016-04-25 15:30
yea i dunno

simon
2016-04-25 15:30
trying delve now

nits7sid
2016-04-25 15:49
Am in correct on this ?The peer on which i deploy a chaincode becomes my primary replica?

simon
2016-04-25 15:54
no

simon
2016-04-25 15:54
all peers will deploy the chaincode

simon
2016-04-25 15:55
the primary is just part of PBFT, and changes during network problems or incorrect behavior of the primary

simon
2016-04-25 15:55
all validating peers are exactly the same

nits7sid
2016-04-25 15:57
ohh... what id the significance of N="4" peers in the network? and also .yaml has 9 test VP's listed..can you please explain me their importance?

nits7sid
2016-04-25 15:57
is**

simon
2016-04-25 15:57
N=4 means that there are 4 peers in the network

simon
2016-04-25 15:58
i don't know which yaml lists 9 test VPs

nits7sid
2016-04-25 15:59
membersrvc.yaml

simon
2016-04-25 16:00
oh i don't know about membersvc

tuand
2016-04-25 16:00
membersrvc.yaml list userids that you can use when developing/testing fabric ... not meant to map to actual number of peers

nits7sid
2016-04-25 16:03
ohh so basically in dev-net environment i can start max 4 VP's

simon
2016-04-25 16:04
i have no idea about dev-net - i'd start a single one

simon
2016-04-25 16:04
passed!

simon
2016-04-25 16:04
jyellick: right on, off-by one regarding block height

tuand
2016-04-25 16:06
right, devnet shows an example. You can start N > 4 peers if you wish

tuand
2016-04-25 16:07
N=4 just happens to show how PBFT works if you allow 1 peer to go byzantine/stop working

simon
2016-04-25 16:07
nits7sid: do you want to work on consensus code, or on chaincode?

nits7sid
2016-04-25 16:13
Well i actually want to work on chaincode..But i had some queries on consensus which got solved..Thanks to @simon and @tuand

simon
2016-04-25 16:13
jyellick: what I pushed works for batch and classic. working on sieve now

jyellick
2016-04-25 17:12
@simon: I still don't understand how your branch handles state transfer, what happens if you witness a weak checkpoint for seqNo 10, and then one for seqNo 20, passing them both to state transfer, how do you know which seqNo state transfer completes to?

simon
2016-04-25 17:12
yea, i just added a XXX for that

simon
2016-04-25 17:13
it will have to retrieve the lastExec seqno from the block

jyellick
2016-04-25 17:13
So seqNo is now in the consensusMetadata in the block?

simon
2016-04-25 17:14
yes

jyellick
2016-04-25 17:15
How do you handle this for Sieve? (It used to be in the block, but we removed it explicitly to support Sieve)

simon
2016-04-25 17:19
sieve uses the block height

simon
2016-04-25 17:19
from the blockchain

jyellick
2016-04-25 17:23
But Sieve could still have null requests? And this would cause the seqNo and block number to diverge?

jyellick
2016-04-25 17:24
I guess what I am getting at, is that Sieve will need to call into `pbft-core.go` to update the `lastExec` properly, and the block number does not help us there?

simon
2016-04-25 17:26
sieve does not have null requests

simon
2016-04-25 17:27
yes, any state transfer needs to call (eventually) into pbft so that it can update lastexec

jyellick
2016-04-25 17:30
Why does Sieve rule out null requests? What if a byzantine primary skips a sequence number when it sends the pre-prepare, this will lead to a view change, and force the next primary to send a null request. I didn't notice any inspection of the PBFT message to ensure that primary does not do this

simon
2016-04-25 17:30
ah, that's a pbft request

simon
2016-04-25 17:31
i thought you meant sieve requests

jyellick
2016-04-25 17:31
Ah, no, certainly Sieve does not allow null requests

simon
2016-04-25 17:31
pbft null requests are fine, sieve itself uses the blockchain height for its blocknumber

jyellick
2016-04-25 17:32
But at the completion of state transfer, how does Sieve know which PBFT sequence number the PBFT core has now `lastExec`-ed?

simon
2016-04-25 17:32
pbft restores its state from the head block

simon
2016-04-25 17:32
like it had just been restarted

simon
2016-04-25 17:33
from the consensus metadata

jyellick
2016-04-25 17:34
But if Sieve is consenting on the block hash, and the block contains the consensus metadata, which contains the PBFT sequence number, then how does Sieve pick the right info, prior to PBFT ordering?

simon
2016-04-25 17:35
i'm just using the last verify exec number

simon
2016-04-25 17:35
it's not nice, but it works

simon
2016-04-25 17:35
it's functionally the same as if verify+exec were in one double message

jyellick
2016-04-25 17:38
But `Verify` is a Sieve message, not a PBFT one? (it doesn't have a sequence number?)

jyellick
2016-04-25 17:38
(Sorry if I am being dense)

simon
2016-04-25 17:54
yes, but it comes through pbft

simon
2016-04-25 17:54
and therefore provides a seqno that can be used to restore to

jyellick
2016-04-25 18:17
It looks like the `Verify` is broadcast directly, but the `VerifySet` comes through as a pbft message which would have a sequence number, so I'm guessing you were talking about. So block `n`, would contain the consensus metadata that actually went into block `n-1`? How then do you actually recover the sequence number which corresponds to a state transfer to block `n`?

simon
2016-04-25 18:19
ah yes, sorry

simon
2016-04-25 18:19
verifyset

simon
2016-04-25 18:20
you're just one off

jyellick
2016-04-25 18:20
But you don't know if there was a null request? You could be 20 off?

simon
2016-04-25 18:36
yes, you could be

simon
2016-04-25 18:36
but if they execute, all is fine

simon
2016-04-25 18:36
because they're null requests

jyellick
2016-04-25 18:55
I guess the problem I am seeing is: vp0,1,2 in network, vp0 byzantine and primary, at block 8, seqNo 8 vp0 deliberately skips seqNo 9 and sends pre-prepare for seqNo 10, for block 9 (which gets a commit cert) This triggers a view change, and new primary vp1 is forced to queue null request, and skip seqNo 9, as seqNo 10 has a commit certificate and it is for block 9. The network now has a stable checkpoint for seqNo 10, corresponding to block 9, which has consensus metadata of seqNo 8. vp3 rejoins the network after crash, vp0 decides now to ignore all requests. Eventually, because of view change, vp3 gets the last stable checkpoint to transfer to, which corresponds to block 9 (seqNo 10), so it performs a state transfer to this point, and sets its seqNo to 8+1 (according to the consensus metadata). Its sequence number is now out of sync, and the network will never make progress, as byzantine vp0 has effectively tricked vp3 into a bad state (and now f=2).

simon
2016-04-25 18:56
which seqno is set?

simon
2016-04-25 18:57
lastexec?

jyellick
2016-04-25 18:57
Yes

jyellick
2016-04-25 18:57
(the above is for k=10 as our defaults, if it was not clear)

simon
2016-04-25 18:58
sorry, i don't understand

simon
2016-04-25 18:59
can you write a step by step sequence?

simon
2016-04-25 19:00
i think you're saying that because state transfer catches up to seqno 10, so should lastexec?

jyellick
2016-04-25 19:02
Is it alright if I start from: The network now has a stable checkpoint for seqNo 10, corresponding to block 9, which has consensus metadata of seqNo 8. ?

simon
2016-04-25 19:02
yes

simon
2016-04-25 19:02
i think i know what you mean

simon
2016-04-25 19:03
so state transfer needs to get the tag to play forward to, and then pass this tag to the callback

simon
2016-04-25 19:03
but will that solve all issues?

jyellick
2016-04-25 19:04
Yes, so that is why state transferred got enhanced to take that interface, which it would pass back to the caller so that it knew which point was transferred to.

simon
2016-04-25 19:05
right - that will be easy

simon
2016-04-25 19:06
what if sieve needs to sync?

jyellick
2016-04-25 19:07
Yes, this is exactly why the executor had that wrapping pattern for checkpoint IDs. The executor would generate an ID which embedded the block hash and sequence number. Then, Sieve would wrap that ID with its own block number information, before passing it into pbft-core as the checkpoint

jyellick
2016-04-25 19:10
That was the big point of the executor, was eliminating the sync logic from PBFT and plugins, that on sync completion only the executor had to manipulate itself, the core and plugins manipulated their own state and initiation of state transfer, so the callbacks were not needed.

jyellick
2016-04-25 19:11
Otherwise, there will need to be some other sort of cooperation mechanism between pbft and plugin to ensure that everything both pieces need for that callback gets embedded

simon
2016-04-25 19:12
probably not needed, because sieve can restore its state from the blockchain

simon
2016-04-25 19:13
it would be nice if pbft could do the same

jyellick
2016-04-25 19:13
Agreed, but Sieve seems to make that impossible, as we need to compute the block before PBFT has been invoked, so we can't embed the PBFT state in there.

simon
2016-04-25 19:13
we can, just an earlier state

simon
2016-04-25 19:14
which sieve can deal with

jyellick
2016-04-25 19:14
But then we have the problem I just described, where the PBFT state changes in an unlikely way (such as null requests)

simon
2016-04-25 19:15
hmm i may have made an incorrect change anyways

simon
2016-04-25 19:15
stopping timer on commit certificate reception

simon
2016-04-25 19:15
not on execute

simon
2016-04-25 19:15
but we can't stop on execute, because they take forever (deploy)

simon
2016-04-25 19:15
so i don't know

simon
2016-04-25 20:26
hm yes

simon
2016-04-25 20:27
jyellick: indeed. it already doesn't work for the behave test. sieve retrieves an older exec and then pbft can't commit

simon
2016-04-25 20:27
i'm pondering a nasty hack

jyellick
2016-04-25 20:28
What sort of hack?

jyellick
2016-04-25 20:28
(other than nasty, of course)

simon
2016-04-25 20:34
i thought about adding 1 to the seqno stored by sieve

simon
2016-04-25 20:34
but i'll try the tag first

simon
2016-04-25 20:34
it won't help with a replica restarting though

simon
2016-04-25 20:34
that's a problem

simon
2016-04-25 20:34
imagine all replicas crash except for one

simon
2016-04-25 20:34
and restart

simon
2016-04-25 20:34
they will be one behind and won't ever catch up

simon
2016-04-25 20:35
my at&t vpn doesn't work

jyellick
2016-04-25 20:39
Yes, the crash scenario remains a problem.

jyellick
2016-04-25 20:39
(for what it's worth, VPN here seems fine)

jyellick
2016-04-25 20:41
So here is a question for you @simon, I have pbft-core doing everything on a single thread, driven by channels now. In ordinary operations, it sits in a select waiting for one of those channels to send a message, and then does the work. The problem this causes with all of our existing test cases is that `process()` assumes that the thread which sends the message is consumed, so it immediately goes to check if things are idle, and they are, because the thread in the select hasn't had a chance to do anything.

jyellick
2016-04-25 20:42
I initially added yet another boolean that could be checked, but it is a race. There is the idle channel type pattern which I used before, but am not crazy about, and I saw you had removed it from `process()`, how would you suggest I get `process()` to block?

fabio
2016-04-25 20:54
has joined #fabric-consensus-dev

jyellick
2016-04-25 22:02
@simon: Everything's not passing yet, but if you are eager to take a look, you can see my issue-973 branch

simon
2016-04-25 22:08
cool

simon
2016-04-25 22:08
it's past midnight here

simon
2016-04-25 22:08
so i'll have a look tomorrow

simon
2016-04-25 22:15
```22:03:57.884 [consensus/obcpbft] executeVerifySet -> INFO 64c Decision successful, but our output does not match (0826124039f938d3f299e917e859e9611c4d31d41336be61842cdd3ef944e38fa201ab8d00bcc5143c602794a1ae1d33495c7a489931ee6e2741787f4afb9f2ff001ba1a1a400b70443534fd04f0e73aa20148edd33f6e5672d311bb37f0718f1f5a156ff8c6bc2c5f033158e2a2178c73c85b2934954ab73bb84f66e257ed8c41670247fffc) vs (082612404c63aaba876263c7f6fbd635ab1fce88ede426806f8e8d601d4eb50b122f6f9b9eccbe0923fb881ec000bc263a4614f17067f677e990f5aa0c7221f35fdffc931a400b70443534fd04f0e73aa20148edd33f6e5672d311bb37f0718f1f5a156ff8c6bc2c5f033158e2a2178c73c85b2934954ab73bb84f66e257ed8c41670247fffc) ```

simon
2016-04-25 22:15
hmmm

simon
2016-04-25 22:15
why?

simon
2016-04-25 22:18
this is going to be so much pain

simon
2016-04-25 22:18
to make this work reliably

simon
2016-04-25 22:20
i think exec needs to go into pbft

simon
2016-04-25 22:20
but there was some reason marko told me

simon
2016-04-25 22:22
if we just can treat the sieve commits as checkpoints

simon
2016-04-25 22:22
that would be great

simon
2016-04-25 22:23
they effectively are

simon
2016-04-25 22:23
they even carry the previous block hash...

simon
2016-04-25 22:24
which means that we should be able to seek to every block

simon
2016-04-25 22:24
i need to move the execDone out to make it completely asynchronous

simon
2016-04-26 11:20
hi

simon
2016-04-26 12:04
@jyellick: you around?

simon
2016-04-26 12:04
i'm trying to understand `TestViewChangeWatermarksMovement` - and why it worked without panic

simon
2016-04-26 12:30
@jyellick: i think `b47c4c3` is wrong

simon
2016-04-26 12:30
i don't think we need to initiate a state transfer when lastExec is lagging behind

simon
2016-04-26 12:30
or is that related to the executor

simon
2016-04-26 12:31
probably it is

jyellick
2016-04-26 12:36
@simon: Let me take a look

jyellick
2016-04-26 12:40
So, we are picking a new checkpoint to begin executing from. I suppose if we believe we have all commit certificates leading up to that checkpoint, then we could simply attempt to execute them, but it should be safe to move our watermarks and perform state transfer. By design the view change guarantees we pick a checkpoint which has at least one non-byzantine node to replicate from.

jyellick
2016-04-26 12:43
`b47c4c3` actually made state transfer less likely to occur, we used to perform state transfer if our low watermark was below the checkpoint, even if we had already executed beyond that checkpoint, which was definitely wrong. The executor mitigated it by discarding the state transfer (as it was for a sequence number less than had already been executed to), but it still screwed up our `lastExec` by lowering it.

simon
2016-04-26 12:43
ah i see

simon
2016-04-26 12:44
what i was thinking was that we may have already have a commit certificate and could execute a request

simon
2016-04-26 12:44
but we just didn't get to it yet

simon
2016-04-26 12:44
in that case we don't need to do a state transfer either

simon
2016-04-26 12:45
the reason i'm talking about this is because suddenly that test is failing and I don't quite understand how it didn't fail before

simon
2016-04-26 12:45
specifically this triggered: ``` if !(len(msgList) == 0 && len(nv.Xset) == 0) && !reflect.DeepEqual(msgList, nv.Xset) { logger.Warning("Replica %d failed to verify new-view Xset: computed %+v, received %+v", instance.id, msgList, nv.Xset) return instance.sendViewChange() } ```

simon
2016-04-26 12:46
of course the msgList is populated (we have at least one request per xset), and xset injected in the test is empty

simon
2016-04-26 12:46
so i'm scratching my head

jyellick
2016-04-26 12:47
`(len(msgList) == 0 && len(nv.Xset) == 0)` should be true, so the first piece should be false, and the second half should not be evaluated, and we should skip it?

jyellick
2016-04-26 12:48
I guess I'm unsure why `msgList` is non-empty

simon
2016-04-26 12:48
oh

simon
2016-04-26 12:48
because there is always at least one request in an Xset

simon
2016-04-26 12:49
at least one null request

simon
2016-04-26 12:49
or not?

simon
2016-04-26 12:50
ah yes

jyellick
2016-04-26 12:50
I would need to check the paper, really, that test is just supposed to be very specifically against the watermark movement and that bit was not being executed, so I did not worry about it

simon
2016-04-26 12:50
which is anyways our modification - original pbft fills the whole log with null requests

simon
2016-04-26 12:51
i'm just wondering how it didn't fail before :simple_smile:

simon
2016-04-26 12:51
I'm getting a panic not implemented, because it is trying to sign the view change message :simple_smile:

simon
2016-04-26 12:52
although you stubbed out the viewchangeimpl

simon
2016-04-26 12:52
so i don't know

simon
2016-04-26 12:53
ah no, that is called as response

jyellick
2016-04-26 12:55
Hmmm, so, looking at `assignSequenceNumbers` quickly, it meshes with that's in my head, if there is nothing in the P-set, because we have not prepared any requests

jyellick
2016-04-26 12:55
Then there is no need to fill in any null requests

simon
2016-04-26 12:58
no, there is always at least one request

simon
2016-04-26 12:58
(in original pbft, the whole L size is filled, with null requests if need be)

simon
2016-04-26 12:59
that was a bug we fixed long ago (as evidenced by a viewchange timeout because nothing executed after view change)

jyellick
2016-04-26 12:59
Then that seems like a bit of a silly check on the length of `msgList`

simon
2016-04-26 12:59
indeed

simon
2016-04-26 13:01
still, how come that test didn't fail before?

jyellick
2016-04-26 13:02
Yes, looking at little harder at `assignSequenceNumbers`, it seems like `maxN` should be our checkpoint seqNo+1, we have a quorum from the vSet, we should get 1 null request.

simon
2016-04-26 13:03
yep

jyellick
2016-04-26 13:03
But yes, I agree, I'm not sure how it didn't fail. I'd say pretty clearly something must have changed, but I didn't think we'd been doing any real mucking around in the view change code (other than this, and another fix or two)

jyellick
2016-04-26 13:03
[And obviously I wouldn't have pushed any of those if they caused a panic in our tests]

simon
2016-04-26 13:04
yea

simon
2016-04-26 13:04
weird

jyellick
2016-04-26 13:11
So do you have an opinion on how to fix `process()` with PBFT processing not on the message thread?

simon
2016-04-26 13:12
regarding your proposed patch?

jyellick
2016-04-26 13:12
Yes

jyellick
2016-04-26 13:12
Or, other feedback on said patch is also welcome

simon
2016-04-26 13:15
i'll have a look in a minute

simon
2016-04-26 13:16
what i noticed when i had a look this morning: there is no back channel when the request gets dropped

simon
2016-04-26 13:16
you call it reject, but I don't see how the rejection works

simon
2016-04-26 13:16
ideally the request wouldn't be dequeued in the first place

jyellick
2016-04-26 13:22
In `handler.go`?

jyellick
2016-04-26 13:25
If so, I moved the reply into a deferred function, so that whatever is set in `response` gets sent when the function exits. The logic before was pretty messy around setting and sending that `response` and the defer simplifies it considerably. So, in the event that a request is not queued, it sends back an error response to the sender.

jyellick
2016-04-26 13:26
Hmm, actually looks like I may have missed some.

jyellick
2016-04-26 13:27
Yes, I only do the rejection reply for the chain transactions, as that was the only place we did it before.

jyellick
2016-04-26 13:28
Would you suggest we reply with a reject for consensus messages as well? I'm not sure what the other side would do with that.

simon
2016-04-26 13:29
ah no

jyellick
2016-04-26 13:29
(And we must dequeue even if there is no space in the buffer. Otherwise we expose ourselves to the same sort of deadlock conditions which this changeset attempts to remove)

simon
2016-04-26 13:30
first, i think we agreed that there are different types of messages/events to consensus that need to be handled differently

simon
2016-04-26 13:31
consensus messages should always be acted upon, while locally generated requests may back up (so that the frontend can inform the clients about overload)

simon
2016-04-26 13:31
that implies that these different events should have different ingress routes. I never liked the generic `RecvMsg` interface

jyellick
2016-04-26 13:33
So, we don't have the promise that consensus messages are always acted upon today. In the event that consensus does not read them fast enough, the gRPC buffer backs up, and they start getting discarded. We are just trading a gRPC buffer, for one we control.

simon
2016-04-26 13:33
ah right

simon
2016-04-26 13:33
well, that is even acceptable

simon
2016-04-26 13:33
i guess

simon
2016-04-26 13:33
we just want to be able to prioritize consensus messages over new requests?

simon
2016-04-26 13:34
or rather, we want to be able to accept new local requests at our own pace (maybe we only have X oustanding requests per replica)

jyellick
2016-04-26 13:36
So, yes, local request actually stay somewhat synchronous thusfar in this changeset, as `engine.go` directly injects them via `RecvMsg` to consensus

simon
2016-04-26 13:36
right

jyellick
2016-04-26 13:36
For remote peers, their transactions go into a per peer channel

jyellick
2016-04-26 13:36
I haven't moved queries into a channel yet, but that can be done.

jyellick
2016-04-26 13:37
The original impetus for the change, and the one that the changeset thusfar targets, is to make sure that consensus messages cannot indefinitely block peer messages.

simon
2016-04-26 13:37
my vision is an object with a defined set of input events, and a shell around it that ensures that this state machine runs synchronously

simon
2016-04-26 13:37
there would also be a queue of output events (broadcast, execute, start state transfer)

jyellick
2016-04-26 13:37
Where the important peer message, is state transfer.

simon
2016-04-26 13:38
then we drop the locks internally

jyellick
2016-04-26 13:39
So, if you look at `pbft-core.go` for instance. You can see that this tries to put the first bit of that in place.

jyellick
2016-04-26 13:39
There is a channel which `RecvMsg` writes into, same with the `stateUpdate`, etc.

simon
2016-04-26 13:40
oh so many channels

simon
2016-04-26 13:40
hmm

jyellick
2016-04-26 13:40
And then there is a single main thread which selects across these channels, the channel which is read from determined your 'type of input', then modifies the state machine. Obviously not as clean as you'd like, but I think it's a first step.

jyellick
2016-04-26 13:41
Channels seems to be the preferred method of Go concurrency, and the constructs like select do make using them pretty nice.

simon
2016-04-26 13:41
right

simon
2016-04-26 13:42
they have different semantics from a single queue, but okay

simon
2016-04-26 13:42
could we remove the goroutine call from newPbftCore?

simon
2016-04-26 13:42
and instead put the burden of processing on some outside entity (i.e. handler)

simon
2016-04-26 13:43
that would simplify our tests as well

simon
2016-04-26 13:43
because then there are no multiple goroutines anymore

jyellick
2016-04-26 13:43
Hmmm

jyellick
2016-04-26 13:44
So I like the idea of moving the `go` call out of `pbft-core.go`, but I don't think it works coming from the handler.

simon
2016-04-26 13:44
i would even go so far to split the dispatch out of pbftcore

simon
2016-04-26 13:44
so that we really only receive events

jyellick
2016-04-26 13:44
The problem is, we have events like the view change timer, which we need to act on, which do not originate from the handler.

simon
2016-04-26 13:44
and somebody else cares about making those events appear in sequence

simon
2016-04-26 13:45
right, so the handler will have to provide a timeout service

simon
2016-04-26 13:45
and it will inject a timeout event

simon
2016-04-26 13:45
make the consensus a completely event driven thing

jyellick
2016-04-26 13:47
I'm still not sure where the actual execution comes from, there needs to be a go routine from somewhere. It seems like it's simply pushing the burden of listening for and serializing events, then driving execution out of pbft and into somewhere else.

simon
2016-04-26 13:48
yes, change the encapsulation

simon
2016-04-26 13:49
i think it would make sense to have a goroutine created by the handler

simon
2016-04-26 13:49
or helper. i don't know why there are two

jyellick
2016-04-26 13:49
I don't dislike it in theory, though I still think the handler is a funny place to put it, since there is one handler per connection.

simon
2016-04-26 13:49
oh

simon
2016-04-26 13:49
right

simon
2016-04-26 13:50
you are correct

simon
2016-04-26 13:50
engine then

jyellick
2016-04-26 13:50
I think this might be a bigger piece of work than fits into this sprint though.

simon
2016-04-26 13:50
it's a goal to aspire to :simple_smile:

simon
2016-04-26 13:51
i hope my monster #1000 will be done

simon
2016-04-26 13:51
and i can go and work on some of this refactoring

jyellick
2016-04-26 13:52
FYI did you see the message about the Sieve de-emphasis?

simon
2016-04-26 13:53
yes

simon
2016-04-26 13:54
that's good

simon
2016-04-26 13:54
why isn't it triggering state transfer now :confused:

simon
2016-04-26 13:56
aha, yes

simon
2016-04-26 13:58
jyellick: so #680 bdd test

simon
2016-04-26 13:58
how does that work

jyellick
2016-04-26 13:58
Ah, so, that will probably stop failing once crash recovery works

simon
2016-04-26 13:58
we put in 30 requests

simon
2016-04-26 13:59
ahahaha

simon
2016-04-26 13:59
and i've been fixing crash recovery to get it working

simon
2016-04-26 13:59
duh :simple_smile:

jyellick
2016-04-26 13:59
Er, yes, but failing I mean, it probably won't initiate state transfer.

simon
2016-04-26 14:00
so we put in 30 requests, then stop vp3, one more request, restart vp3, then 6 more requests

simon
2016-04-26 14:01
somehow that means 37 in total, but i can see requests 38 being executed

jyellick
2016-04-26 14:01
Deploy is a transaction

simon
2016-04-26 14:01
ah right

simon
2016-04-26 14:01
so 31, stop vp3, 32, start vp3, 33..38

jyellick
2016-04-26 14:01
In the past, VP would restart, thinking its seqNo was 0, so, it would witness enough checkpoints outside its watermarks, and initiate state transfer

simon
2016-04-26 14:02
K=2, L=8

jyellick
2016-04-26 14:02
So, with crash fault working, then it will not see those as outside its watermarks.

simon
2016-04-26 14:02
checkpoint is 30

jyellick
2016-04-26 14:02
Because it knows its last seqNo was 31

simon
2016-04-26 14:02
right

simon
2016-04-26 14:03
oh i just realized...

jyellick
2016-04-26 14:03
So, if you bump up the tail end of the requests from 6 additional to say, 10, I think you should get it.

simon
2016-04-26 14:03
i should move the low watermarks to the highest restored checkpoint

jyellick
2016-04-26 14:03
Yes

simon
2016-04-26 14:03
not to lastexec's previous low watermark

jyellick
2016-04-26 14:03
Right

simon
2016-04-26 14:05
in any case - need to inject 2 more, i guess?

simon
2016-04-26 14:05
so that checkpoint 40 is reached

simon
2016-04-26 14:05
which should trigger a state transfer

simon
2016-04-26 14:05
or do i wait for 42?

jyellick
2016-04-26 14:06
I think you want 4 more

simon
2016-04-26 14:06
i never understood why `weakCheckpointSetOutOfRange` stops `recvCheckpoint` processing

jyellick
2016-04-26 14:06
So, it shouldn't

jyellick
2016-04-26 14:07
There is a TODO saying that we should basically resubmit those 'out of watermark range' checkpoints, because chances are, we already have a weak checkpoint cert

jyellick
2016-04-26 14:07
But that's an optimization, not a correctness statement

simon
2016-04-26 14:07
oh, resubmit

simon
2016-04-26 14:07
because we discarded them

jyellick
2016-04-26 14:07
Right

simon
2016-04-26 14:08
because they were out of watermarks

simon
2016-04-26 14:08
hehe

jyellick
2016-04-26 14:08
Haha, yep

simon
2016-04-26 14:08
once you practically implement such a system and can't assume infinite storage for messages...

simon
2016-04-26 14:09
did you ever see how applications in erlang use state machines?

simon
2016-04-26 14:10
or in general in most functional languages

jyellick
2016-04-26 14:11
Erleng is unfortunately something I've never gotten a chance to explore, though it might be worth doing, since the whole 'actor model' message passing thing I think originates there?

simon
2016-04-26 14:12
i suppose

simon
2016-04-26 14:12
1 scenario passed, 0 failed, 26 skipped

simon
2016-04-26 14:12
woohoo!

ghaskins
2016-04-26 14:12
@jyellick: Erlang is awesome

ghaskins
2016-04-26 14:13
i dont know if they invented the actor model, but it certainly uses it

ghaskins
2016-04-26 14:13
if you want to build highly available clusters though, it makes a great backend platform

simon
2016-04-26 14:14
so what they usually do is: `fun machine(state, inputchan) { dispatch msg <- inputchan { case <pattern match 1> ....; case <pattern match 2> ...; case <pattern match N> do something with msg; machine({new state based on old state}, inputchan) }`

simon
2016-04-26 14:15
ghaskins: imagine BFT as erlang runtime service

ghaskins
2016-04-26 14:15
i have, it would work pretty well I would imagine

simon
2016-04-26 14:15
and you send messages to other processes

simon
2016-04-26 14:15
not sure whether the queues are bounded

ghaskins
2016-04-26 14:18
i forget how it manages that

simon
2016-04-26 14:27
lol, now it even works with sieve

simon
2016-04-26 14:27
yey

simon
2016-04-26 14:28
jyellick: i just pushed my remaining commits

jyellick
2016-04-26 14:28
Great, thanks @simon

simon
2016-04-26 14:28
last behave run before i submit the pull request

simon
2016-04-26 14:29
but issue #680 passed for all 3 consensus plugins

simon
2016-04-26 14:29
it got a bit longer :confused:

simon
2016-04-26 14:29
it thought it would be maybe 10 commits

simon
2016-04-26 14:29
not 70


simon
2016-04-26 17:25
appreciate some more review

jyellick
2016-04-26 17:31
Working on finishing up my changeset, will post it here and then try to give yours some review.

stan.liberman
2016-04-26 21:28
has joined #fabric-consensus-dev

jyellick
2016-04-26 21:29
@simon: Your PR seems to cause a slight problem in the ledger tests: `core/ledger/ledger_test.go:682: ledger.GetTXBatchPreviewBlock undefined (type *Ledger has no field or method GetTXBatchPreviewBlock)`

jyellick
2016-04-27 04:25
A minor milestone. With @simon's latest changeset and the locking/threading changes from the impending 919/973 changeset on top of it, my standard vagrant environment has managed to process a 10k request flood (20 threads issuing 500 requests each as fast as they could be accepted) with no problems (no deadlock, not even any view changes)

muralisr
2016-04-27 04:27
^^ very cool

gengjh
2016-04-27 05:47
@jyellick: did you try to enable the security and privacy in your env?

simon
2016-04-27 06:58
jyellick: cool!


simon
2016-04-27 07:20

simon
2016-04-27 08:40
jyellick: i rewound my tree for the pull request - you'll have to rebase

andyz
2016-04-27 09:01
has joined #fabric-consensus-dev

simon
2016-04-27 09:11
hi andy!

cca
2016-04-27 09:19
which entity in the current system creats the notion of a "block"? is a "block" the same as a "batch" of transactions that go togother through consensus?

simon
2016-04-27 09:35
DSL died :confused:

simon
2016-04-27 09:36
cca: yes, the consensus layer calls (via helper) into the ledger and chaincode layers. this forms the block in the end

simon
2016-04-27 11:36
```13:05:19.728 [consensus/obcpbft] request -> INFO 7d8 Sieve replica 1: New consensus request received: t+IAW6yQaAgkLOaXRAECggsj05/tr+/zhORO0WuITlfhzLTBwZJZ4ytFPRDOw5fRiUC2KCV5Wt3BNLYW+elbVg== TEST: process looping TEST: processing message without testing for idle TEST: new message, delivering TEST: deliver TEST: Sending unicast TEST: process looping TEST: process looping 13:05:19.728 [consensus/obcpbft] request -> INFO 7d9 Sieve replica 0: New consensus request received: XNKPQiijIp29a1KAPbM5BJoi00YSW0yqY4IJP/Z0wN9rh+UzJUz7xvDFv/RQ/OYdlIT3MDPC1GGFyMeWOvvPIg== 13:05:19.728 [consensus/obcpbft] recvRequest -> DEBU 7da Sieve primary 0 received request ```

simon
2016-04-27 11:36
why does this happen?

simon
2016-04-27 11:37
why does recvRequest get called not directly after "Sending unicast"?

simon
2016-04-27 11:55
aha

simon
2016-04-27 11:55
what a bug

simon
2016-04-27 11:56
surprisingly stuff still worked

simon
2016-04-27 11:56
really unclear how

simon
2016-04-27 14:22
tuand: i'm looking at #796 now, rebasing onto my tip

simon
2016-04-27 14:23
tuand: is #756 done?

tuand
2016-04-27 14:24
not checked in ... i have a couple of issues to clean up with praveen/angelo before i do that

simon
2016-04-27 14:24
oh

simon
2016-04-27 14:24
what's outstanding?

simon
2016-04-27 14:25
the CI for my persist branch failed, but I think it is a timing issue

tuand
2016-04-27 14:26
behave test case ?

simon
2016-04-27 14:27
yes

simon
2016-04-27 14:27
once i'm done with #796, which keeps tricking me, i can have a look at what needs to be done for #756

tuand
2016-04-27 14:28
i'm trying to find marko's note ... he mentioned a couple of issues that he wanted us to look at

simon
2016-04-27 14:32
yea

simon
2016-04-27 14:32
that's the one i pinned to the channel

simon
2016-04-27 14:33
or should i have starred it?

simon
2016-04-27 14:33
does that get shared?

tuand
2016-04-27 14:34
found it ! from April 22


simon
2016-04-27 14:34
yea, this one

simon
2016-04-27 14:34
oh 22nd?

simon
2016-04-27 14:35
i gotta take a walk, think about the lifecycle of complaints and aborted executions...

tuand
2016-04-27 14:45
#915 ... which should be closed

tuand
2016-04-27 14:46
#1180 ... i talked to the author a bit , not sure the code that randomizes the invoke is correct. Mark P. also wanted to see if we can get community help in recreating

tuand
2016-04-27 14:48
#1171 ?

simon
2016-04-27 14:52
it would be great if i could close and rename issues

simon
2016-04-27 14:52
silly to have to maintain a parallel list

simon
2016-04-27 14:53
i don't know if there is any community help interested in that

simon
2016-04-27 14:53
few pbft experts out there

simon
2016-04-27 14:54
although having the issues pinned to the side is useful

simon
2016-04-27 14:55
click channel info -> pinned items -> issue priorities

tuand
2016-04-27 14:55
i think it might be something in the test case ... i did this for playback weeks ago and it ran fine

simon
2016-04-27 14:55
it stays on the right

simon
2016-04-27 14:55
i'll have a look at the code

jyellick
2016-04-27 14:56
919, and 973 which are not on there are more or less done, pending rebase onto Simon's changeset

tuand
2016-04-27 14:56
although it's disturbing that fabric would go crazy and log garbage text

tuand
2016-04-27 14:56
919/973 agree

simon
2016-04-27 14:56
how can you work with these numbers?

simon
2016-04-27 14:57
i can never relate them to what they mean

tuand
2016-04-27 14:57
because i just looked at them :simple_smile:

simon
2016-04-27 14:57
my brain refuses to learn abstract numbers

tuand
2016-04-27 14:57
i do have to jump back to github/issues a lot

simon
2016-04-27 14:58
then that list i pinned should help

simon
2016-04-27 14:58
can just check items off

tuand
2016-04-27 14:58
btw, how did you debug travis where all the tests failed on your pr ?

simon
2016-04-27 14:58
i ran it again

simon
2016-04-27 14:58
locally

simon
2016-04-27 14:58
turned out i missed the ledger test

simon
2016-04-27 14:59
my fault

tuand
2016-04-27 14:59
oh ... 1 fail kills all of travis ?

simon
2016-04-27 14:59
yea

tuand
2016-04-27 15:00
in any case, i wish travis would keep logs around ... mentioned that to ramesh and chris a few times

simon
2016-04-27 15:01
yea

simon
2016-04-27 15:01
seems they're about to switch to jenkins

simon
2016-04-27 15:01
dunno why

bcbrock
2016-04-27 15:40
has joined #fabric-consensus-dev

jyellick
2016-04-27 22:12
@simon: Looks like my changes on top of yours pass the travis tests.


jyellick
2016-04-27 22:44
@gengjh: Sorry I just now remembered I'd meant to document my test for you, sometimes slack scrolls too fast. I used the docker-compose files in the consensus package, basically following the steps outlined in `fabric/consensus/compose-consensus-4.md`. This brought up four validators, I used the PBFT classic (so that each request would be consensed upon, and with one transaction per block). @tuand is probably a better person to ask about the details of that compose environment (I just use it according to the instructions), but I believe security is enabled. Once up, I used SOAPUI to drive login, deploy, and then drive the load for the test as described.

jyellick
2016-04-27 22:57
And, just confirmed that the PR referenced above retains the same stability with that 10k request test

tuand
2016-04-28 01:20
@gengjh: @jyellick the compose-consensus-4 docker config default to security enabled, privacy false

gengjh
2016-04-28 01:31
@jyellick: ok, cool. We have already setup a bigger env which has 10+ VPNs and cross network on separate physical machines. But haven’t tried the performance, will update here when we finish the test.

tuand
2016-04-28 01:43
@gengjh: i think @bmos299 would like to hear about your setup. Could you ping him ? or advertise what you have on # ? sorry if you've already done so and I missed it

simon
2016-04-28 13:54
@tuand: i can start looking at #756 if you want

tuand
2016-04-28 13:55
can i hold on to it ? ... i like to get through my rebase and behave issues today/tomorrow

simon
2016-04-28 13:55
sure

simon
2016-04-28 13:56
does it still write to some random file?

tuand
2016-04-28 13:56
yes, waiting until system chaincode available ... so should be work for next week

tuand
2016-04-28 13:57
i think there's a separate #issue for that too

simon
2016-04-28 13:57
we have like 5 issues that would be closed

tuand
2016-04-28 13:58
both you and @jyellick waiting for PRs ... i need to remind our committers :simple_smile:

simon
2016-04-28 13:58
yea

georglink
2016-04-28 14:01
has joined #fabric-consensus-dev

tuand
2016-04-28 14:08
@jyellick: @simon, can you ping your PR numbers to @sheehan ? he's a bit swamped so we might have to wait until next week

jyellick
2016-04-28 14:12
@simon: @tuand #1279 is mine, which contains @simon's persistence stuff, so probably fine to close that other PR (since the CI wasn't passing anyway)

jyellick
2016-04-28 14:16
(though the complaints PR is separate)

igor
2016-04-29 11:15
has joined #fabric-consensus-dev

tuand
2016-04-29 19:55
my last behave problem with #756 ... some time on peer startup, obcExecutor gets called too early, getCurrentInfo() can't find the genesis block which makes queueThread() panic with a memory fault since there's no error check on the getCurrentInfo() result.

tuand
2016-04-29 19:56
now, my branch is way behind so I need to rebase based on @jyellick 's PR 1279 which probably already has the fix ?

jyellick
2016-04-29 19:57
That PR is on top of Simon's which completely removed the executor

tuand
2016-04-29 19:58
thought so :simple_smile: I need to keep up

jyellick
2016-04-29 19:58
And we no longer use the real genesis block, which is a bit of a bug, but should fix your problem

phelanm
2016-05-01 00:37
has joined #fabric-consensus-dev

phelanm
2016-05-01 00:37
hi. any hints about getting these behave tests to pass? (output to follow)

phelanm
2016-05-01 00:38

tuand
2016-05-01 02:10
interesting ! on the one failing test, it looks like we have one peer doing 7 more invokes than the other ones ...

tuand
2016-05-01 02:10
is this test failing consistently ?

tuand
2016-05-01 02:13
can you run only that test and record the peer logs ? do `behave -n "1 peer (vp3) is byzantine" -D logs=y`

tuand
2016-05-01 02:14
and if reproducible, can you create an issue and attach the logs ? I'll try to take a closer look tomorrow when I can get to my laptop

takekiyokubo
2016-05-02 13:47
has joined #fabric-consensus-dev

sheehan
2016-05-02 14:46
@tuand: I was sitting next to @phelanm and saw him reproduce the issue. It did not happen on my laptop though.

jyellick
2016-05-02 14:46
What code level is this? Does it include the latest PRs?

sheehan
2016-05-02 14:46
yeah, was the current master

sheehan
2016-05-02 14:47
I just saw the following with PR 1225. not sure if it’s related

sheehan
2016-05-02 14:47

sheehan
2016-05-02 14:47
If these are a result of not having your latest consensus PRs, just let me know. I’m working to merge those today, just want to merge earlier PRs first to avoid conflicts

jyellick
2016-05-02 14:48
I think there's a strong chance that they'll be dependent on the newer consensus PRs

jyellick
2016-05-02 14:49
Simply because so much of the code has changed

jyellick
2016-05-02 15:08
@sheehan: @phelanm Actually, with respect to one VP executing more requests than another, this is a bug I have observed (though only under stress, not in behave), which should be fixed in the pending PRs.

sheehan
2016-05-02 15:09
cool. I ran issue_680 again locally and it passed

joseph
2016-05-02 16:19
has joined #fabric-consensus-dev

simon
2016-05-02 16:27
sheehan: !! right?


jyellick
2016-05-02 16:28
I tried playing around with that at some point, but never got it to work, do you have a working example?

sheehan
2016-05-02 16:54
I didn’t write that code. In the same boat as @jyellick. I know @kostas was investigating logging at one point.

tuand
2016-05-02 17:01
I think @bcbrock wrote the logging code awhile ago? I've only done logging on a peer level so far

sheehan
2016-05-02 17:03
sometimes I just hardcode the level locally when testing :hushed: I need to stop doing that

cbf
2016-05-02 17:49
slaps @sheehan with a large trout

bcbrock
2016-05-02 17:55
My ears are burning, but I can’t figure out what is the logging issue you are discussing.

jyellick
2016-05-02 17:58
So, we'd like to be able to run the peer generally at default logging levels (say INFO or WARN or whatever), but get DEBUG messages for specific packages within peer (for instance `consensus`, or `consensus/obcpbft`)

bcbrock
2016-05-02 18:00

bcbrock
2016-05-02 18:01
So for example, CORE_LOGGING_LEVEL=warning:consensus=debug

bcbrock
2016-05-02 18:01
warning applies as the default, “consensus” goes at debug

jyellick
2016-05-02 18:02
Thanks @bcbrock I'll go give that a try now

jyellick
2016-05-02 18:28
@bcbrock: @sheehan @tuand @simon The logging is working for me now. My misunderstanding was believing that "consensus" would log for "consensus/*", but these packages need to be specified individually (not a bug, just user error)

tuand
2016-05-02 18:31
so e.g. `core_logging_level=info:consensus/controller=debug:consensus/obcpbft=debug` ?

jyellick
2016-05-02 18:35
Presently running a stress test with just `CORE_LOGGING_LEVEL=warning:consensus/obcpbft=debug`, but suspect additional `:` will process as you indicate, based on the code

sheehan
2016-05-02 18:42
@jyellick: I’m seeing consistent failures on the behave test "Scenario Outline: chaincode example02 with 4 peers and 1 membersrvc, issue #680 (State transfer) -- @1.3 Consensus Options” with PR https://github.com/hyperledger/fabric/pull/1255

sheehan
2016-05-02 18:42

sheehan
2016-05-02 18:42
any reason why those would be related?

jyellick
2016-05-02 18:45
Based on the diff there's nothing obvious to me, would need to look at logs

bcbrock
2016-05-02 19:14
“sieve” may be the smoking gun. Do other consensus algorithms pass?

jyellick
2016-05-02 19:18
1.3 is definitely Sieve. There are fixes for Sieve in the consensus PRs

bcbrock
2016-05-02 19:18
See for example my last comments in PR #1231

jyellick
2016-05-02 19:19
(And there were definitely known bugs in Sieve before those patches)

sheehan
2016-05-02 19:24
@simon: Sorry, this one needs to be rebased. https://github.com/hyperledger/fabric/pull/1265 Must have been an earlier conflicting PR

ratnakar
2016-05-02 19:52
has joined #fabric-consensus-dev

jyellick
2016-05-02 19:58
@simon: That will break #1279 as well, want me to rebase onto master, then you can rebase onto mine, or vice versa? (mine has your persistence, but not complaints)

jyellick
2016-05-02 22:13
@simon I did a rebase, which you can see here https://github.com/hyperledger/fabric/pull/1325, might be easier for you to rebase your complaints onto that

jyellick
2016-05-02 22:13
There's a rare (and especially shy, as it goes away with verbose logs) timeout that seems be happening obcpbft unit tests, I'm investigating

jyellick
2016-05-03 05:30
There was a bug in the view change which caused a periodic failure in the obcpbft tests, fixed and pushed into that PR. Still seeing that behave test fail in CI ("consensus still works if 1 peer (vp3) is byzantine"), though it's run successfully 5/5 times on my vagrant env.

jyellick
2016-05-03 05:50
Was able to figure out how to retrieve the logs from CI, looks to me like it's yet another simple timeout problem in that behave test. Extended it and pushed to that PR, executing now, hopefully I will have good news in the morning.

simon
2016-05-03 09:32
how did you get the logs from CI?

jyellick
2016-05-03 12:48
There is a place you click on expand, then a number of urls, like pastebin which you can reference

simon
2016-05-03 12:49
ah, maybe that's new

simon
2016-05-03 12:50
i was trying to figure out a way to abstract this multitude of channels into a single event processor

simon
2016-05-03 12:50
i guess the easiest way would be what you use for injectchan

simon
2016-05-03 12:51
i.e. transform calls(events) with arguments into closures that get queued

jyellick
2016-05-03 12:55
Ah, hmmm, yes, the injection was a late addition, just to handle those little one off cases, but there's no reason you couldn't do it more generally

jyellick
2016-05-03 13:02
I'd like to make sure that whatever mechanism we use for the eventing doesn't require this idle channel pattern I've been using. I don't like it, but it's the most reliable and least invasive solution I've been able come up with. Would be nice if the driving thread were owned by UT, would make determining when things are finished easier. I sort of imagine the solution is to have the driving thread enter from outside PBFT, but doing that without a lot of code duplication has not been obvious.

simon
2016-05-03 13:03
:simple_smile:

simon
2016-05-03 13:03
yes, i agree

simon
2016-05-03 13:04
i wondered about blocking vs non-blocking enqueue

simon
2016-05-03 13:04
so far it seems that you are using blocking enqueues?

simon
2016-05-03 13:04
and i think that's the right thing

jyellick
2016-05-03 13:04
Yes, using blocking queues simplified life considerably

jyellick
2016-05-03 13:05
Otherwise if a calling thread makes multiple events, you aren't guaranteed the order of arrival

jyellick
2016-05-03 13:06
(back in 15 minutes or so)

jyellick
2016-05-03 13:34
The biggest problem I have trying to wrap my head around using a single channel for events vs doing a select over multiple channels, is how to handle the timer event. The very nice thing that the new threading does is fix the view timer, such that if the pbft thread resets the view timer, when the elapsed time would have otherwise caused that timer to fire, then the pbft thread never gets that event. If there is for instance a timer service out there, which attempts to queue a timer event when the timer expires, but the event thread is off doing work, there's no way to take that event back if the timer is reset. There could obviously be some sort of callback scheme, the timer event has to call back to the timer service when an event arrives to verify that it is a legitimate event, but that seems really ugly.

simon
2016-05-03 13:47
My thought was to have an event dispatch processor that also provides timed events

simon
2016-05-03 13:47
if you cancel a timed event, it never gets delivered

simon
2016-05-03 13:48
usually this is implemented using a timer wheel

simon
2016-05-03 13:48
you just wait for the next timer to expire (and wait for other events)

jyellick
2016-05-03 13:51
So the event dispatch processor holds the thread which drives pbft execution?

simon
2016-05-03 16:28
jyellick: that "this indicates a bug" is benign

simon
2016-05-03 16:28
not sure if it will even be present with the new code

jyellick
2016-05-03 16:29
@simon: Where/what are you talking about?

tuand
2016-05-03 16:30
are you looking at #1180 ?


simon
2016-05-03 16:30
yes

tuand
2016-05-03 16:31
i'm going to walk over and talk to rick ... not sure i understand what these logs mean

simon
2016-05-03 16:31
well

simon
2016-05-03 16:31
he's testing old code, i guess

simon
2016-05-03 16:32
well, old relative to our PRs

jyellick
2016-05-03 16:32
Yes, there was a bug... which was eventually fixed, but, the bulk of that code disappeared in Simon's PR

jyellick
2016-05-03 16:32
Sometimes that message was benign, othertimes it truly indicated a bug, at some point, I thought I switched that message.

tuand
2016-05-03 16:33
the latest logs says peer is dying on startup with connection errors ... i don't think the network is configured right

simon
2016-05-03 16:33
which peer?

tuand
2016-05-03 16:34
vp0/vp1

tuand
2016-05-03 16:35
ah wait ... vp0 is culprit

tuand
2016-05-03 16:37
still think config is wrong ... i'll go talk to him

simon
2016-05-03 16:39
also he really doesn't have to test any consensus stuff until the new PRs are in

simon
2016-05-03 16:39
well, jason's

tuand
2016-05-03 16:44
all at lunch :simple_smile: will check again later in aftermoon

muralisr
2016-05-04 01:37

muralisr
2016-05-04 01:37
that was with "CORE_SECURITY_ENABLED=true CORE_SECURITY_PRIVACY=true ./peer peer --peer-chaincodedev"

muralisr
2016-05-04 01:39
I can try setting “statetransfer.blocksperrequest” to be non zero but I’d expect not to have to set anythig

jyellick
2016-05-04 02:00
@muralisr: There is a default defined in the `consensus/obcpbft/config.yaml`, I know there's been some effort to make the config not be accessed relative to the current working directory, but I don't believe that's in yet. This is only the first symptom, you'll likely see many other pbft problems for lack of config if that is the problem.

muralisr
2016-05-04 02:00
@jyellick: let me make sure I’m not doing something wrong first

muralisr
2016-05-04 02:01
I thought I had reverted my work

muralisr
2016-05-04 02:01
but let me make sure

jyellick
2016-05-04 02:01
I may have misquoted you the path... hold on

jyellick
2016-05-04 02:04
Actually, looks like @simon moved the state transfer config out of pbft and into `core.yaml`. Looks like it is there in mine.

muralisr
2016-05-04 02:07
false alarm… sorry about that @jyellick

jyellick
2016-05-04 02:08
No problem, the best sort of problems are the ones that don't exist! Glad to hear it.

muralisr
2016-05-04 02:08
:simple_smile:

muralisr
2016-05-04 02:09
I did have some of my changes

nits7sid
2016-05-04 09:58
Is the leader elected in sieve model of consensus same as the primary in the classic mode?

simon
2016-05-04 10:05
yes

simon
2016-05-04 10:06
sieve uses the pbft primary as leader

simon
2016-05-04 10:11
nits7sid: what are you specifically interested in?

nits7sid
2016-05-04 10:14
i am actually interested in the difference between the working of sieve and the classic pbfts. In sieve i read that the processes executes the operations and sends the result hashes or signatures to the leader and finally the leader commits the transaction if it recieves f+1 identical hashes. so does that mean only leader can commit the transaction?

simon
2016-05-04 10:15
the leader just provides a lifecycle management

simon
2016-05-04 10:15
there is always just one request (or set block of requests) outstanding

simon
2016-05-04 10:18
(1) leader sends EXECUTE, (2) all replicas tentatively execute, but do not commit, (3) replicas send signed result to the leader, (4) leader waits until it collects enough results, (5) leader sends this result set through PBFT, (6) all replicas receive the result set and can commit (deterministic), rollback (non-deterministic), or sync (classified as deterministic, but local replica had a non-deterministic result)

nits7sid
2016-05-04 10:20
ohh.. so the leader doesnt not actually filters out the non-deterministic transactions from the result-set?

simon
2016-05-04 10:21
no, all replicas do this based on the result set

nits7sid
2016-05-04 10:22
In 5) point, through PBFT meaning the primary-backup mechanism is carried out again? or is it just that replicas now only checks the result-set and filters accordignly?

simon
2016-05-04 10:23
what mechanism?

simon
2016-05-04 10:23
the sieve leader submits a new request into pbft, acting as the pbft primary: it sends a pre-prepare of the result-set "request"

nits7sid
2016-05-04 10:25
ohh okay.. i get it

nits7sid
2016-05-04 10:25
thanks @simon

simon
2016-05-04 10:41
jyellick: i don't understand how the idlechan works

simon
2016-05-04 10:41
it doesn't seem to account for outstanding timers

risto.laanoja
2016-05-04 12:22
has joined #fabric-consensus-dev

weizhao
2016-05-04 14:16
has joined #fabric-consensus-dev

jyellick
2016-05-04 14:17
@simon: You're correct, the idlechan does not account for outstanding timers, those are checked first

sachikoy
2016-05-04 14:18
has joined #fabric-consensus-dev

vita
2016-05-04 14:20
has joined #fabric-consensus-dev

simon
2016-05-04 14:33
can we drop the fuzzing tests?

simon
2016-05-04 14:33
i don't think they're useful anymore

jyellick
2016-05-04 14:36
I hate dropping tests, though usually they seem to reveal bugs in the mock network moreso than bugs in the code.

simon
2016-05-04 14:40
:simple_smile:

hill
2016-05-04 14:41
has joined #fabric-consensus-dev

simon
2016-05-04 14:47
TestViewChangeUpdateSeqNo is failing when i run it in my complaints code and a complete run

simon
2016-05-04 14:49
some timeout issue...

simon
2016-05-04 15:05
whole day spent on rebasing again :simple_smile:

muralisr
2016-05-04 15:06
I know the feeling :simple_smile: did the same yesterday

simon
2016-05-04 15:06
rebase actually forces you to write larger patches

simon
2016-05-04 15:06
which is not good

simon
2016-05-04 15:06
well, commits

muralisr
2016-05-04 15:06
ah right

simon
2016-05-04 15:06
or you get the same collision in every commit :confused:

muralisr
2016-05-04 15:06
and “just for the sake of rebase"

simon
2016-05-04 15:07
right

simon
2016-05-04 15:07
now i squished all my commits into one

simon
2016-05-04 15:07
it is a better solution in the end

muralisr
2016-05-04 18:08
… I need some help with the right way the to implement the “happy path” of #588

muralisr
2016-05-04 18:09
most of the changes are in the consensus component

muralisr
2016-05-04 18:09
here’s the change I need to implement


muralisr
2016-05-04 18:09
the key are two things

muralisr
2016-05-04 18:10
1) need to account for transaction errors (which we have been ignoring)

muralisr
2016-05-04 18:10
2) need to send out an event on block failure too (we have been only sending event from CommitTx…)

muralisr
2016-05-04 18:11
First question… where would I store the txerrs from ExecuteTransactions in helper.ExecTxs

muralisr
2016-05-04 18:12
I can define a curBatchErrs field in Helper to mimic curBatch

muralisr
2016-05-04 18:13
would that be ok ?

jyellick
2016-05-04 18:29
@muralisr: I've often wondered if `helper.go` is really the place for this work to live, ultimately, it seems odd to me that it is the consensus package which is doing this accounting, but in the interest of expediency, since the `curBatch` stuff is already there, it seems like a good place to put it to me.

muralisr
2016-05-04 18:29
thanks, @jyellick … that were exactly my thoughts too

muralisr
2016-05-04 18:30
now I know who to send this for review :simple_smile:

muralisr
2016-05-04 18:30
thanks!

jyellick
2016-05-04 18:30
No problem

muralisr
2016-05-04 18:32
at least they have the same codepaths/semantics … I’ll try hard to maintain that

muralisr
2016-05-04 22:45
The “consensus failure” part above is only useful if we have the transactions (ie, curBatch) at that point. For noops, this is true. Is it true for other consensus ?

muralisr
2016-05-04 22:47
I’m wondering if 'send “blockfailure” event ‘ should be changed to ‘send “blockfailure” event if we have the tx list'

muralisr
2016-05-04 22:48
or more safely, send only block success events (ie, on CommitTxBatch)

jyellick
2016-05-05 15:00
@muralisr: Having a little trouble making sense of this. What does 'blockfailure' mean?

muralisr
2016-05-05 15:05
@jyellick : basically opposite of commit

muralisr
2016-05-05 15:05
if you look at the picture above, the “consensus failure” path

jyellick
2016-05-05 15:07
What about the scenario where it's rolled back on some nodes, but committed on others?

muralisr
2016-05-05 15:08
right, that and also other complex cases where we may not have all the information to send out a meaninful “event” …

jyellick
2016-05-05 15:08
[This is basically only a Sieve case, where consensus is achieved, but instead of commit, the block comes from statetransfer]]

muralisr
2016-05-05 15:08
I’m thinking the best we can do is the “consensus success” path

muralisr
2016-05-05 15:10
it appears simple and better to do something that we can guarantee (and just an extension of what we do already)

muralisr
2016-05-05 15:10
what do you think ?

jyellick
2016-05-05 15:10
(Sorry, juggling scrum too)

jyellick
2016-05-05 15:21
So, I think we can definitely handle the 'consensus success' path, as you say. If a peer calls into commit, that should indicate that consensus has occurred for that particular block.

tuand
2016-05-05 15:22
so every peer will send an event ... is it ok that the listener will receive multiple events about the same tx ?

jyellick
2016-05-05 15:23
The only thing that consensus can more generally give you, is an agreement on the hash of the latest block. This happens at every block for Sieve, and ever `K` blocks for the other PBFT variants. This is the only insight consensus has into the 'execution output' (unless you're talking about Sieve, but it is a special case, and its long term viability is in question)

jyellick
2016-05-05 15:26
I'll also say, that except for the black sheep that is Sieve, the other consensus plugins all call Begin Exec Commit in essentially an atomic fashion. The output from Exec is possibly logged, but otherwise ignored (last we looked at it, it wasn't particularly useful, I think it was the state hash as output), and it's not at all clear what consensus should do if there is an error invoking exec.

tuand
2016-05-05 15:26
@simon how are things with your rebase ? also PR #1277 ?

muralisr
2016-05-05 15:27
@tuand : it will be ideal if listener receives one event - block success/block failure

muralisr
2016-05-05 15:28
I was hoping ledger.CommitTx and ledger.RollbackTx will catch those nicely

jyellick
2016-05-05 15:28
But for a byzantine consensus protocol, you cannot rely on any one particular replica following the protocol

muralisr
2016-05-05 15:28
ok

jyellick
2016-05-05 15:29
In the case that you tolerate `f` byzantine nodes, you would need, at a bare minimum, `f+1` attestations that a block was committed, or not

jyellick
2016-05-05 15:30
You could on the other hand flip this a bit, and say that a VP is trusted to the NVPs it is talking to

jyellick
2016-05-05 15:30
And say that each VP only broadcasts the event to its connect NVPs

jyellick
2016-05-05 15:32
Sorry if I'm playing catch up, what is the relationship between the listener, and the VP?

jyellick
2016-05-05 15:35
[And to jump back for a second, I'm still not sure what a block failure would be, outside of Sieve. If a request is proposed for a particular block, and the network ends up not deciding to do that block, it's not like that request is gone, in general it will simply find its way into a later block]

sheehan
2016-05-05 17:36
if I want to run pbft with a single peer, are there any special configurations I need to set?

jyellick
2016-05-05 17:46
@sheehan: @simon: Is the expert here, but I think setting `N` to 1, and `f` to 0 might do it

posnerj
2016-05-06 00:21
has joined #fabric-consensus-dev

nits7sid
2016-05-06 07:01
Hello.. in the obcpbft/config file the timeout value for singleblock=2s. So does that mean a block will be created in every 2s?

simon
2016-05-06 10:09
what does "consensus failure" mean?

simon
2016-05-06 10:13
nits7sid: if you run batch, then a new block will created after 2s, even if the block is not full

nits7sid
2016-05-06 10:16
ohh.. i was testing the pbft and noops working. in nops when i call a Invoke chaincode function 100 times all the 100 transactions goes into one block. But in case of pbft it creates 100 seperate blocks. why does this happen? or am i going wrong in my configuration?

simon
2016-05-06 10:23
you need to use pbft batch

muralisr
2016-05-06 11:58
@jyellick: @simon would you have some time to review https://github.com/muralisrini/fabric/tree/report_transaction_and_block_failures_take2%23issue588 for changes to consensus for issue 588 please?

simon
2016-05-06 12:01
i will

muralisr
2016-05-06 12:03
thanks!

muralisr
2016-05-06 12:03
it only takes care of the “consensus succeeded” path. For now we don’t issue an event on “consensus failure"

muralisr
2016-05-06 12:05
the later needs more that to even determine all the cases so we can do it correctly (if it can be done)

simon
2016-05-06 12:07
feels wrong that the result is non-hash (i.e. non-consensus) data

muralisr
2016-05-06 12:10
it is

muralisr
2016-05-06 12:11
there is some explanation in the issue why we do it this way

muralisr
2016-05-06 12:11
and how we can change it in future to do the right thing

garisingh
2016-05-06 13:01
has joined #fabric-consensus-dev

michaelhaley
2016-05-06 13:44
has joined #fabric-consensus-dev

simon
2016-05-06 14:41
tuand: do we have #754 in now?

tuand
2016-05-06 14:42
no ... haven't finished rebasing

simon
2016-05-06 14:42
ok

tuand
2016-05-06 14:43
btw, you mentioned squishing all your commits together ? how did you do that ?

simon
2016-05-06 14:43
i did a rebase -i and marked commits as fixup

tuand
2016-05-06 14:44
that's what i'm doing ... thought you mentioned something about reducing the # of commits so the committers don't complain so much

simon
2016-05-06 14:46
well, it is a balance - too many small commits (especially those fixing previous errors) - difficult to review; too big of a commit - difficult to review as well

tuand
2016-05-06 14:46
well 90% of my commits are -m "resolve rebase xxxxx"

tuand
2016-05-06 14:47
ok, if i'm not done monday, please kill me

simon
2016-05-06 14:47
resolve rebase?

simon
2016-05-06 14:47
don't you fix up the commit itself?

tuand
2016-05-06 14:48
rebase -i ... hit a rebase conflict, resolve conflict through a commit

simon
2016-05-06 14:48
oh

tuand
2016-05-06 14:48
then rinse and repeat

simon
2016-05-06 14:57
is jyellick around today?

jyellick
2016-05-06 14:57
I'm here

simon
2016-05-06 14:57
hi!

simon
2016-05-06 14:58
do you think now would be a good time to move state transfer from consensus into core?

simon
2016-05-06 14:58
that would allow us to slim down the consensus API significantly

jyellick
2016-05-06 14:58
This could be a good time to do it, after your persistence changes eliminated the direct callins

simon
2016-05-06 14:59
right

simon
2016-05-06 14:59
i snuck that in :simple_smile:

jyellick
2016-05-06 15:00
Working with bcbrock yesterday I found a state transfer bug of some sort, haven't figured out which side the failure's on, but was going to be mucking with that code anyway, I can go ahead and move it out. What package would you propose? I would think `peer/statetransfer`?

jyellick
2016-05-06 15:00
(Was also looking at 1368 and am very confused)

jyellick
2016-05-06 15:03
(Actually, found it)

simon
2016-05-06 15:04
i think sheehan and jeff may have an opinion on where it should go

simon
2016-05-06 15:04
i'd guess core/statetransfer

simon
2016-05-06 15:05
so how does 1368 happen?

jyellick
2016-05-06 15:05
Just added it to the issue text, but basically, when rebasing onto your stuff, I missed one of the instances where the Sieve thread started executing in `pbft-core.go`

jyellick
2016-05-06 15:06
So there were concurrent modification to the checkpoint map, so it changed size between counting the entries, and filling the array

simon
2016-05-06 15:06
oh, is this the same as 1366?

jyellick
2016-05-06 15:06
Good chance

simon
2016-05-06 15:06
execDoneSync() calls?

jyellick
2016-05-06 15:06
Yep

simon
2016-05-06 15:06
ha

simon
2016-05-06 15:06
yes

jyellick
2016-05-06 15:07
They should be able to just be `execDone` now I believe


jyellick
2016-05-06 15:08
Perfect, thanks

tuand
2016-05-06 15:10
@bcbrock: was also mentioning #1091 and #919 yesterday ? I haven't seen any updates on these though

jyellick
2016-05-06 15:15
I worked with @bcbrock on 1091 yesterday, which revealed the statetransfer bug I referred to above, but also exposed that the test was perhaps being too stringent on the behavior of PBFT. The key point being that even with 4 non-byzantine nodes, the only guarantee is that 2f+1=3 of them will be participating in the network, and that the network will be making progress. It is okay, and largely expected, from a protocol perspective that 1 of them is behind and not participating. This is something I've discussed a little with @vukolic and something we'll be looking to improve upon the future, but it's not a bug, changing this behavior would be an enhancement.

jyellick
2016-05-06 15:16
With respect to 919, that has been closed, I think maybe you're thinking of 915, which I think may be fixed, but, will be tough to verify until Simon's PR 1384 goes through.

jyellick
2016-05-06 15:17
@sheehan: @jeffgarratt We are thinking this sprint would be an opportune time to move statetransfer out of consensus and into some other package, where would you propose we move it in the package structure? Since it is dependent on the peer network, my proposal would be `core/peer/statetransfer` but I'm open to other places.

tuand
2016-05-06 15:17
yup, typo #915

jeffgarratt
2016-05-06 15:18
@jyellick: @sheehan makes sense to me.

jeffgarratt
2016-05-06 15:19
@jyellick: please feel free to contact me any time if you want to discuss the task

jyellick
2016-05-06 15:19
@jeffgarratt: @simon Suggested `core/statetransfer`, maybe he can explain his reasoning

jeffgarratt
2016-05-06 15:20
I should free up a bit early next week

jyellick
2016-05-06 15:21
Thanks, I likely will confer, as it will need to stop using the helper calls, and go directly into peer. My guess is that it should be a fairly simple migration, but will double check to make sure it's being hooked in correctly.

jeffgarratt
2016-05-06 15:21
sounds good

simon
2016-05-06 15:26
jyellick: i suggest still using interfaces

simon
2016-05-06 15:26
jyellick: but then you can use ledger as something that provides this interface

jyellick
2016-05-06 15:27
@simon: Yes, I would still try to use the MessageHandlerCoordinator interface, which `helper.go` wraps, so the method names would change slightly. It is getting the reference to it properly that I might need Jeff's help with.

jyellick
2016-05-06 15:28
I simply meant that statetransfer should no longer need to use the `consensus.Stack` interface

simon
2016-05-06 15:28
right, that's my idea

simon
2016-05-06 15:28
the stack interface could be reduced considerably

jyellick
2016-05-06 15:28
Exactly, which would be great

simon
2016-05-06 15:29
but given that most interfaces that statetransfer uses are basically passed through by the helper, you could use the ledger itself (or a mock)

simon
2016-05-06 15:29
of course not for all

simon
2016-05-06 15:30
and it won't be a single "object" providing all these interfaces

simon
2016-05-06 15:30
there will be ledger, communication, etc.

jyellick
2016-05-06 15:32
Yes


simon
2016-05-06 15:34
any idea what is happening here?

tuand
2016-05-06 15:37
is this #1358 ?

jyellick
2016-05-06 15:37
--- FAIL: TestBatchCustody (6.01s) obc-batch_test.go:105: Expected replica 0 to have one block

jyellick
2016-05-06 15:39
(I really still want to know how this isn't a huge bug in `go test`. When a test fails, you should get the output of the failing test, and that's it. Instead, you get just tons of random output from other tests in addition to the failing test, and it makes it obnoxious to figure out what actually failed. Why I always end up running with `go test -v`)

simon
2016-05-06 15:46
some timing issue

tuand
2016-05-06 15:49
just saw #1379 `**23:58:39.168 [consensus/util] RegisterChannel -> WARN 055 Received duplicate connection from <nil>, switching to new connection**`

tuand
2016-05-06 15:50
we can use <nil> as a map key ?

jyellick
2016-05-06 15:52
Yes, I've noticed that too, need to investigate, I did not believe that `http://handler.peerHandler.To()` could be `nil`

jyellick
2016-05-06 15:53
My money is on handler's being unnecessarily instantiated. It bears investigation, just hasn't obviously broken anything.

suma
2016-05-06 17:22
has joined #fabric-consensus-dev

juanblanco
2016-05-08 17:35
has joined #fabric-consensus-dev

pablofullana
2016-05-08 22:09
has joined #fabric-consensus-dev

shubhamvrkr
2016-05-09 06:26
has joined #fabric-consensus-dev

simon
2016-05-09 09:54
tuand: i don't understand your "resolve rebase" commits

simon
2016-05-09 11:08
i'm trying to add "bound number of requests"

simon
2016-05-09 11:09
i'm trying to figure out what the conditions are to reject a request because too many are outstanding from that replica

simon
2016-05-09 11:10
the problem is that i may not have processed the oldest request, while the primary already did

simon
2016-05-09 11:11
@vukolic: any suggestions?

simon
2016-05-09 11:11
async networks are difficult

simon
2016-05-09 11:14
maybe we don't have to bound that, because the watermarks bound it

vukolic
2016-05-09 12:10
bound number of requests should in principle be per client

vukolic
2016-05-09 12:10
not sure what exactly what the count of reqs from a replica means - can you pls elaborate?

vukolic
2016-05-09 12:25
@simon:

simon
2016-05-09 12:32
@vukolic: sorry, was afk for a while

simon
2016-05-09 12:33
@vukolic: i wanted to bound the number of requests a replica can inject at any time

simon
2016-05-09 12:33
@vukolic: not per client

simon
2016-05-09 12:33
clients are potentially millions?

vukolic
2016-05-09 12:35
ok so the problem is consistency of the counters at primary vs other replicas?

simon
2016-05-09 12:36
yea

simon
2016-05-09 12:36
the primary may already have executed a request and now allow a new request from a replica

simon
2016-05-09 12:37
why some other replica didn't - and then would consider that primary malicious

simon
2016-05-09 12:37
so obviously we need something like watermarks

vukolic
2016-05-09 12:38
what's the goal of this - preventing a DoS?

simon
2016-05-09 12:47
yes


vukolic
2016-05-09 12:50
the issue happens only because of PBFT watermarks

vukolic
2016-05-09 12:50
if the high watermark would be low watermark + 1

vukolic
2016-05-09 12:50
i.e., ordering and execution in sync - it would not happen right?

vukolic
2016-05-09 12:52
to circumvent this - one could bound the number of requests from last stable checkpoint

vukolic
2016-05-09 12:53
and have a signle number for this

vukolic
2016-05-09 12:53
as in each replica can have at most that many request in between checkpoints

vukolic
2016-05-09 12:54
this can be as well specified as reqlimit = reqfactor*checkpointsize

vukolic
2016-05-09 12:54
would that address the concern?

simon
2016-05-09 12:57
aha

simon
2016-05-09 12:57
i'll have to think about this

simon
2016-05-09 12:57
maybe watermarks are enough anyways

simon
2016-05-09 13:49
tuand: do you want some review for 756?

simon
2016-05-09 13:50
tuand: i guess best would be to squash all these fixup commits into the main change

tuand
2016-05-09 13:50
i do but want to do this after i fix up the unit tests

simon
2016-05-09 13:50
i don't understand how these resolve rebase commits even appear in the first place

tuand
2016-05-09 13:51
i can show you in 30 min or so ... about to go in a call now

simon
2016-05-09 13:51
okay

simon
2016-05-09 13:52
so many calls

tuand
2016-05-09 13:53

simon
2016-05-09 14:03
i don't like these loops waiting for an ID

simon
2016-05-09 14:03
that's got to work better than that

vukolic
2016-05-09 14:06
@simon - watermarks may introduce another point of instability - and not clear exactly how this would be done

simon
2016-05-09 14:06
yea, i mean the watermarks that pbft already uses

simon
2016-05-09 14:07
they already pace the number of pbft requests that can be in flight at any given time

vukolic
2016-05-09 14:07
as we count from checkpoint to chekpoint in a static way - with the above we would have what I see as a simpler solution

vukolic
2016-05-09 14:07
indeed they do - but I see this as a root of an issue and you see it as a solution :simple_smile:

simon
2016-05-09 14:08
i don't know if counting between checkpoints helps either

simon
2016-05-09 14:08
example:

simon
2016-05-09 14:09
i didn't execute to a point yet that will produce a checkpoint

simon
2016-05-09 14:09
i get a request from the primary

vukolic
2016-05-09 14:09
the idea goes like this

simon
2016-05-09 14:09
well, pre-prepare, beyond the checkpoint

vukolic
2016-05-09 14:10
you execute batches from 1 to CHK and then from CHK+1 to 2*CHK

vukolic
2016-05-09 14:10
it does not matter when you execute

vukolic
2016-05-09 14:10
you just allow a param*CHK requests from a given replica in between any two checkpoints

vukolic
2016-05-09 14:10
param could be as well <1

vukolic
2016-05-09 14:11
it is certainly <= batchsize :simple_smile:

simon
2016-05-09 14:12
hmm

simon
2016-05-09 14:12
well we have to allow at least CHK requests

vukolic
2016-05-09 14:12
why?

simon
2016-05-09 14:12
so that if only one replica is injecting requests, and none of the other replicas want to inject requests

simon
2016-05-09 14:12
that the network can progress

vukolic
2016-05-09 14:13
aha

vukolic
2016-05-09 14:13
good catch

simon
2016-05-09 14:13
also:

simon
2016-05-09 14:13
how does this interact with batching

vukolic
2016-05-09 14:13
but with that then the approach does not work at all, since you need param = batchsize

simon
2016-05-09 14:13
maybe we shouldn't care about this at all?

simon
2016-05-09 14:13
i mean the byzantine part

simon
2016-05-09 14:13
and only care about our own performance

simon
2016-05-09 14:14
after all, the goal is to locally supply information so that the frontend can reject new requests

simon
2016-05-09 14:14
what we need is a way to pace ourselves, without overloading the primary

simon
2016-05-09 14:15
because if the primary (assembling batches) is overloaded, then we start complaining

simon
2016-05-09 14:15
we do a view change, but nothing will change

simon
2016-05-09 14:17
tuand: i don't like that writing whitelist to random file at all

simon
2016-05-09 14:17
not the right way of doing things

tuand
2016-05-09 14:55
yes, that writing to file was something while we wait for system chaincode

tuand
2016-05-09 14:55
i think #830 but i need to check

muralisr
2016-05-09 14:56
so @simon, the whitelist is primarily what we need the sys CC for ?

simon
2016-05-09 15:53
muralisr: and other changes in the consensus configuration

simon
2016-05-09 15:53
i.e. if you figure out that performance is unsatisfactory, and you want to increase the batch size, etc.

muralisr
2016-05-09 16:12
I see

muralisr
2016-05-09 16:13
so some of it _can_ be dynamically adjusted

simon
2016-05-09 16:27
all of it, basically

jeroiraz
2016-05-09 19:37
has joined #fabric-consensus-dev

nick
2016-05-10 02:32
has joined #fabric-consensus-dev

tuand
2016-05-10 14:28
@tuand uploaded a file: https://hyperledgerproject.slack.com/files/tuand/F17KLN3DY/view-change.zip and commented: now why would the peers decide to to view-change so quickly and so frequently ? Also, in peer4.log, connections from <nil> ? and at end of peer4.log, note the attempts at state transfer ?

tuand
2016-05-10 14:28
I've asked them to turn on debug logging level

jyellick
2016-05-10 14:29
So, state transfer is horribly broken at the moment

jyellick
2016-05-10 14:30
And, assuming it gets triggered a few times it will certainly put the network into a weird state where it constantly changes views,

jyellick
2016-05-10 14:31
Most of the logic was lost when the executor was removed, and the tests became largely invalid

jyellick
2016-05-10 14:31
I'll try to submit something to patch it up later today, but still wrestling with how to fix it

tuand
2016-05-10 14:31
i think this is from a commit late friday

jyellick
2016-05-10 14:32
That would have contained the broken state transfer, I think

tuand
2016-05-10 14:33
first thing I'm scratching my head about is the view changes seem to happen right after peer startup ... will have to see if debug logs show anything right after startup

tuand
2016-05-10 14:35
and of course , all behave tests ran fine for me friday till today and i run this app fine on friday

simon
2016-05-10 14:54
oh i broke it?

simon
2016-05-10 14:54
i didn't realize

jyellick
2016-05-10 14:57
Yes, `Initiate` is called only during construction of the helper, so once statetransfer completes once, it will never execute again.

jyellick
2016-05-10 14:57
Similarly, the mock tests were changed to bypass the statetransfer code entirely, and used generated blocks, rather than copying them from other peers

jyellick
2016-05-10 14:59
I see what your assumptions were now (they've actually made splitting the statetransfer tests out much easier) and I'm working to make them true, but undecided how to handle everything yet.

simon
2016-05-10 15:00
oh!

simon
2016-05-10 15:00
i didn't realize that this was the usage pattern

simon
2016-05-10 15:00
i've been terrible at not commenting interfaces - we need to fix this

simon
2016-05-10 15:01
yea i wanted to figure out how to copy blocks from other peers - but I didn't know how

jyellick
2016-05-10 15:01
We seem to be enforcing the go vet / golint on PRs now, so I think they will get cleaned up

simon
2016-05-10 15:01
i think overall, it may be a good idea to limit our tests as much as possible

simon
2016-05-10 15:01
i.e. if it can be done with a single peer, do it with a single peer

jyellick
2016-05-10 15:01

simon
2016-05-10 15:01
instead of creating a network

jyellick
2016-05-10 15:02
It's hard to see in the diff, really, it should be compared against https://github.com/hyperledger/fabric/pull/1416

jyellick
2016-05-10 15:02
But basically, this rips out the vast majority of the mock ledger, moves it to state transfer, and then uses a much simpler mock ledger for pbft, I think it's an improvement.

jyellick
2016-05-10 15:02
(It's also necessary in order to get the statetransfer tests out of obcpbft)

simon
2016-05-10 15:02
i saw your other patch - i was a bit confused that sheehan wanted documentation for lines you didn't even touch

simon
2016-05-10 15:03
great

jyellick
2016-05-10 15:03
Yeah, I added them, I guess they were visible in the diff, so, they became my responsibility

simon
2016-05-10 15:03
let me know if there is anything i can do - otherwise i'll just create documentation commits and go vet stuff

simon
2016-05-10 15:04
maybe if we don't combine the "stack" into one interface, but pass many separate interfaces, we can also simplify each of those interfaces

simon
2016-05-10 15:05
ideally we'd get rid of the helper and just hand in the ledger, for example

jyellick
2016-05-10 15:05
So maybe I can get your input on how to fix the `helper.go` statetransfer usage. The usage pattern for statetransfer was designed as follows: Detect state transfer required Call `Initiate` Call `AddTarget` feeding in potential hashes as they become available Receive callback indicating `Finished`

simon
2016-05-10 15:05
let me have a look

simon
2016-05-10 15:05
oh you have to call finished?

jyellick
2016-05-10 15:06
That's a callback

jyellick
2016-05-10 15:06
The reason why the usage is a little messy, is because the underlying ledger infrastructure gives us no guarantee that a particular target is reachable

simon
2016-05-10 15:06
yea

jyellick
2016-05-10 15:06
Otherwise it would simplify things immensely

simon
2016-05-10 15:06
what is the relation between `Completed` and `Finished`?

jyellick
2016-05-10 15:07
Oops, I mispoke, out of memory

simon
2016-05-10 15:07
ah the same

simon
2016-05-10 15:07
okay

jyellick
2016-05-10 15:07
You can see the interface doc: ``` // Listener is an interface which allows for other modules to register to receive events about the progress of state transfer type Listener interface { Initiated() // Called when the state transfer thread starts a new state transfer Errored(uint64, []byte, []*protos.PeerID, interface{}, error) // Called when an error is encountered during state transfer, only the error is guaranteed to be set, other fields will be set on a best effort basis Completed(uint64, []byte, []*protos.PeerID, interface{}) // Called when the state transfer is completed } ```

simon
2016-05-10 15:07
can we just call `Initiated` again from `Completed`?

jyellick
2016-05-10 15:07
So, the problem is the race, and why `AddTarget` does not implicitly call `Initiate`

sheehan
2016-05-10 15:08
@simon: @jyellick who should I track down for documenting public functions in consensus?

simon
2016-05-10 15:08
sheehan: oh for sure it is us

simon
2016-05-10 15:09
sheehan: just that jyellick's patch didn't even touch these interfaces :simple_smile:

simon
2016-05-10 15:09
i'll be working on it

sheehan
2016-05-10 15:09
yeah, I realize. Thought it would be easy to try to fix up as we go

simon
2016-05-10 15:09
but we're trying to severely cut down the interfaces exposed

sheehan
2016-05-10 15:09
but as long as someone is working on it, that’s fine

simon
2016-05-10 15:10
jyellick: what race?

simon
2016-05-10 15:10
jyellick: maybe we need to move the state transfer interaction into the plugin statemachine

simon
2016-05-10 15:10
so that no races will happen

jyellick
2016-05-10 15:12
Essentially, imagine the following: t1: Calls initiate t1: Adds target t2: Calls Completed t1: Adds target (implicitly calls initiate) t1: receives Completed t1 now believes that state transfer is not occurring

simon
2016-05-10 15:12
hmm

simon
2016-05-10 15:12
yea, that race will always exist

simon
2016-05-10 15:13
well it gets back its interface

simon
2016-05-10 15:13
so it can tell that the completed refers to a different stage

simon
2016-05-10 15:13
the question is, where do we put the interlock and retry

simon
2016-05-10 15:14
it has to be in t1

jyellick
2016-05-10 15:16
It can tell the completed came from a different target, but that's also okay, this would also be a fine scenario: t1: Calls initiate t1: Adds target A t1: Adds target B t2: Calls Completed (to target A) t1: receives Completed

jyellick
2016-05-10 15:17
The guarantee made by state transfer is that one of the targets will be reached, not the last one added

simon
2016-05-10 15:17
okay

simon
2016-05-10 15:19
so ideally i'd prefer the helper not to do any mediation in the state transfer, and have a state transfer object passed to the plugin

simon
2016-05-10 15:20
and for tests we replace this with something simple (as we have anyways right now)

simon
2016-05-10 15:20
and then the plugin needs to figure out whether to initiate + addtarget again

jyellick
2016-05-10 15:27
So the executor eliminate the race by having the initiating thread block until completion. I think that might be the most direct way to fix this. Simply have a go routine who's responsibility it is to interact with state transfer, when that goroutine is busy (via say a default on a select) the pbft thread would be invoking `AddTarget`, when the thread isn't busy, it would grab the target, and invoke `Initiate`, then `AddTarget`.

jyellick
2016-05-10 15:28
Eh, maybe not, not sure that fixes this.

jyellick
2016-05-10 15:30
Double calling `Initiate` is a problem, but could even be solved with a mutex, the problem is knowing whether or not statetransfer is executing in PBFT. Maybe the key would be to listen on the `Initiated` callback to keep track.

jyellick
2016-05-10 15:32
@simon: What would you think of eliminating `Initiate()` as a call, having `AddTarget` implicitly initiate, then require the caller to listen for the `Initiated` and `Completed` events to figure out what's going on?

simon
2016-05-10 15:33
hmmm

simon
2016-05-10 15:33
initiated is not used at all

jyellick
2016-05-10 15:33
Right, I added it because it seemed like valuable information, but never had the need to consume it.

simon
2016-05-10 15:34
so the race is really:

simon
2016-05-10 15:34
addtarget, completed sent, addtarget, completed arrives. actually we should keep waiting

simon
2016-05-10 15:34
that has to be handled in pbft

simon
2016-05-10 15:34
in the statemachine

jyellick
2016-05-10 15:35
Yes, assuming addtarget now implicitly calls initiate

simon
2016-05-10 15:35
right

simon
2016-05-10 15:35
i think that makes sense to do

jyellick
2016-05-10 15:35
Okay, I like it, I'll get to work on coding it up

simon
2016-05-10 15:35
cool!

simon
2016-05-10 15:35
let me know what i can do

jyellick
2016-05-10 15:35
Will do, once I've got the code together, I'll post it here for review

jyellick
2016-05-11 15:12
@simon: https://github.com/hyperledger/fabric/pull/1445 Here's the reworking of the statetransfer logic

simon
2016-05-11 15:14
great

jianzhang98
2016-05-11 16:21
has joined #fabric-consensus-dev

tuand
2016-05-12 00:33
so #756 ... how it's done right now is that we put the PeerIDs in a list, sort by PeerID.name and use the index of the sorted list as the replica id

tuand
2016-05-12 00:34
we can't create the consenter until we have a list of the N validating peers since we won't have the correct id until the list is complete

tuand
2016-05-12 00:36
so, when we do `newObcXXXX`, we wait for the list to be completed before we get our own id and finish up `newPbftCore` etc ...

tuand
2016-05-12 00:37
i do this by using a channel, `newObcXXX` waits on the channel, over in Peer, when the list is complete, i put a bool on the channel

tuand
2016-05-12 00:40
the question now is, how do i make the engine wait until `obcXXXX` is ready to receiveMsg ?

tuand
2016-05-12 00:41
i can wait on another channel but that means it's one more method added to consenter interface ?

jyellick
2016-05-12 00:49
@tuand: Does the value of the `bool` matter? In general, if you are using channels to communicate an event that has no value, I think the accepted go practice is to use a `chan struct{}`, and then simply write to it with a `ch<-struct{}{}`.

tuand
2016-05-12 00:50
the value doesn't matter ... it's just to wake up the other end

jyellick
2016-05-12 00:50
I'd vote for the `struct{}{}` approach then, still trying to process the rest of your message

jyellick
2016-05-12 00:51
So, I think it's also important that you remember we need to support the `N=1` case

jyellick
2016-05-12 00:51
Where you can't expect any `handler.go` activity to speak of

tuand
2016-05-12 00:53
good point, i'll have to handle that when i initialize the list ( in `newPeerWithEngine` right now)

jyellick
2016-05-12 00:56
So, in order to keep things general, I think that the `Consenter` interface should maybe have PeerJoined` and `PeerLeft` calls.

jyellick
2016-05-12 00:57
It's making the consensus interface slightly more complex, but I think it's easy to ignore them if you don't care

jyellick
2016-05-12 00:57
(Just like the state transfer callbacks that were added)

tuand
2016-05-12 00:58
ya ... along the same lines, i was thinking Consenter interface would have a Ready() that the engine could wait on before doing RecvMesg()

jyellick
2016-05-12 00:58
Well, so I think I wouldn't make the engine wait

jyellick
2016-05-12 00:58
I would actually push the logic into the consenter

jyellick
2016-05-12 00:59
Simply have the plugin thread not read from the other end of the message channel until it is satisfied that it's connected to enough peers

jyellick
2016-05-12 00:59
I think @simon agreed on the phone that the right place for this is definitely inside PBFT, the fact that it's been pushed towards the helper/handler is unfortunate, and if we have an opportunity to push it back, we should.

tuand
2016-05-12 01:02
not exactly how we summarized it on the issue but i like that better. I can wait in the `obcXXXX.RecvMsg()` method

jyellick
2016-05-12 04:54
The way I would envision it, would be that prior to entering into the primary event loop, the thread would wait for the 'whitelist available' event (as it would only occur once) then enter into the normal event loop, which would handle the events from `RecvMsg`

lhy555
2016-05-12 06:19
has joined #fabric-consensus-dev

jyellick
2016-05-12 13:31
@simon: I'm trying to understand this complaints stuff, trying to fix that failing test referenced in #

jyellick
2016-05-12 13:32
In obcbatch, if we receive a complaint, first we check if we are the primary, and if so, go do some stuff and return

jyellick
2016-05-12 13:33
If we are no the primary, we dedup, then `submitToLeader`, then for some reason check again if we are the primary, which I don't see how we can be? Then if not, we unicast the request to the primary

jyellick
2016-05-12 13:37
Then, eventually if a complaint timer expires, we'll get an event, and assuming the view is active, we'll send the view change request.

tuand
2016-05-12 14:07
I made an interim commit https://github.com/tuand27613/fabric/tree/whitelistTest ... let's see how the discussion goes

tuand
2016-05-12 14:07
now to see what all this make stuff is about

simon
2016-05-12 14:32
jyellick: oh i have a patch

simon
2016-05-12 14:32
jyellick: the problem is that i can't run my behave tests

jyellick
2016-05-12 14:32
Oh, so do I...

simon
2016-05-12 14:32
good

simon
2016-05-12 14:32
i just fixed the test

simon
2016-05-12 14:33
what do you have?

simon
2016-05-12 14:33
sorry, i keep being distracted - gf broke her foot and needs help around the house

jyellick
2016-05-12 14:34
Ouch, definitely take care of that, far more important

jyellick
2016-05-12 14:34
There are two pieces, was going to run past you for review, one is, I think the timer in the custodian is not being handled properly, because I was seeing multiple view changes sent back to back (same millisecond)

jyellick
2016-05-12 14:35
The other piece was that the PBFT view change was being invoked on the batch thread, rather than being injected onto the PBFT one

simon
2016-05-12 14:36
oh wow

simon
2016-05-12 14:36
so many bugs :confused:

simon
2016-05-12 14:36
i'm sorry

simon
2016-05-12 14:36
all this parallelism

simon
2016-05-12 14:37
i thought i had tracked it down to replica 0 pbft expecting a request which never gets executed because the two requests are being reordered somehow

jyellick
2016-05-12 14:39
Maybe you can help me understand this flow. When a request(s) that is in custody expires, it makes its callback into the consumer, which sends a view change

jyellick
2016-05-12 14:39
What happens to the requests which remain in custody? It seems like they should all be submitted to the new primary, and their deadlines reset?

jyellick
2016-05-12 14:42
I see that `Restart` is called out of `viewChange`, but it seems like we need to disable the custody timers after we request the view change, before the view actually changes

simon
2016-05-12 14:55
yea that is sort of an issue

simon
2016-05-12 14:55
but does it really matter?

simon
2016-05-12 14:56
ah now with two threads, we could actually expire the custody after pbft did the view change, but before we received the view change event

simon
2016-05-12 14:57
this all needs to run in one thread

jyellick
2016-05-12 15:06
It would certainly make things easier.

simon
2016-05-12 15:26
frustrating races

jyellick
2016-05-12 15:46
@simon: Another bug, `viewChange()` in batch modifies batch internal state, then tries to submit stuff to PBFT if the leader, which is on the PBFT thread, which can cause a deadlock, moving this to be an event

maro
2016-05-12 16:15
has joined #fabric-consensus-dev

simon
2016-05-12 16:50
maybe we need more real mock tests?

cbf
2016-05-12 16:50
real mock … hmmm

simon
2016-05-12 16:50
where we test things really separate from each other

simon
2016-05-12 16:50
cbf: you know, right now we do sort of integration tests within consensus

simon
2016-05-12 16:51
we create whole virtual networks, etc.

simon
2016-05-12 16:51
instead of just testing for one specific microscopic behavior

cbf
2016-05-12 16:51
yes, but you could write more unit tests that mock the setup and just test the function

simon
2016-05-12 16:51
yes that's what i mean

cbf
2016-05-12 16:51
exactly what we need because the setup etc is the part that takes time

simon
2016-05-12 16:51
oh i'm talking about the unit tests

cbf
2016-05-12 16:51
I was just making fun of the juxtaposition of real and mock

cbf
2016-05-12 16:52
our unit tests are really integration tests

simon
2016-05-12 16:52
yes

cbf
2016-05-12 16:52
I have been trying to make this case for a while

simon
2016-05-12 16:52
go is more difficult to mock than other languages

simon
2016-05-12 16:53
at least it feels that way

cbf
2016-05-12 16:53
ginko can handle because it is go

cbf
2016-05-12 16:53
I’ve been discussing this on and off with Jeff

cbf
2016-05-12 16:54
certainly it is more difficult when your test infra is python and the target is go

simon
2016-05-12 16:56
we need something that can generate mocks


simon
2016-05-12 17:07
yea i had a quick look, didn't find anything about mocks

simon
2016-05-12 17:08
`Ginkgo does not provide a mocking/stubbing framework. It’s the author’s opinion that mocks and stubs can be avoided completely by embracing dependency injection and always injecting Go interfaces. `

simon
2016-05-12 17:08
yea, i guess we should have that?

simon
2016-05-12 17:12
conveniently it completely ignores "what if i want to check that function X calls function Y of interface Z"

cbf
2016-05-12 17:14
I can hook you up with the author to discuss… alternately I have a few on my team that are very familiar

jyellick
2016-05-12 18:35
@simon: Why is `currentExec` a pointer? Is it purely to give us an 'unset' value for testing?

kostas
2016-05-12 19:17
@tuand: https://hyperledgerproject.slack.com/archives/fabric-consensus-dev/p1463013613001539 -- can you expand here? everything that you describe in these messages was taken care of in the branch that I had delivered IIRC

kostas
2016-05-12 19:18
(I read @simon's comments BTW on that thread and I agree with all of them)

tuand
2016-05-12 19:20
so i had to make a few changes ... write to db instead of to a file ... use a channel to wait until we have the whitelist ... also, we added an engine object to handle talking directly to the local peer instead of grpc

tuand
2016-05-12 19:21
i was trying to see where obcXXXX should wait for the whitelist to complete

tuand
2016-05-12 19:22
but in the end, waiting on isSufficientlyConnected in RecvMsg seems still the right place

jyellick
2016-05-12 20:07
It might work, but my vote would be to put the channel read in front of the `for` loop in `main()`

jyellick
2016-05-12 20:08
The whitelisting isn't a message, so, I don't think that's the right place to block for it

garisingh
2016-05-12 20:12
@jyellick - quick question - in PBFT, if a peer goes down and later comes back up and joins the network, my assumption would be that it will find that it is out of sync and request a state transfer or in the absence of new transactions during checkpointing this will be discovered. Correct?

jyellick
2016-05-12 20:14
@garisingh: In the absence of transactions, no recovery will happen, recovery is driven by eavesdropping, so there needs to be requests on the network


garisingh
2016-05-12 20:16
okay - so peer goes down, misses N transactions, comes back up and transaction N+1 is invoked.

jyellick
2016-05-12 20:18
The specific value of N matters a little here, but in general, no, unless there is some other sort of problem which prevents transaction n+1 from being committed to 2f+1 peers, the freshly restarted peer will not execute it

garisingh
2016-05-12 20:20
okay - well on a related note, let's say that state transfer does happen. State transfer is more like a "bulk" event rather than transaction processing, correct? Meaning both the K/V store and the "ledger" (blocks) are force updated? Meaning potentially a different path through the code than normal transactions?

jyellick
2016-05-12 20:21
That is correct, the normal path is basically `BeginBatchTX/ExecTX/CommitTxBatch`, and this modifies the state, and commits a block to the chain.

jyellick
2016-05-12 20:21
The state transfer path retrieves blocks from the network, validates them against the valid hash from the consensus checkpointing, and then commits them to the ledger through `PutRawBlock`

jyellick
2016-05-12 20:22
Once the blockchain is intact, the state is played forward to be current utilizing the state deltas stores in the DB. At each step, the state hash is verified against the block has in the chain.

jyellick
2016-05-12 20:23
(Under some conditions, such as when the replica is very far behind, it will actually retrieve a fully copy of the state, and a partial copy of the blockchain, then restore the remainder of the chain in the background)

garisingh
2016-05-12 20:23
gotcha - so basically like bulk database transfer with some checkpointing (i.e. validating the state hash)

jyellick
2016-05-12 20:24
Right, a similar idea. There are some problems with this at the moment, since we store some things like chaincodes on the block which should probably be stored in the state, so it is possible that the partial copy of the blockchain is not good enough.

jyellick
2016-05-12 20:25
@muralisr: might have some insight as to if/when the chaincodes will move off the block and into the state

garisingh
2016-05-12 20:26
many thanks

jyellick
2016-05-12 20:27
No problem, always happy to help

simon
2016-05-13 10:14
jyellick: correct, it's a pointer so that it can be NULL

xinxi
2016-05-13 10:25
has joined #fabric-consensus-dev

xinxi
2016-05-13 10:27
Hi guys, I am recently studying the consensus protocol of HyperLedger. If my understanding is correct, PBFT's leaders could suffer from DDoS attacks as they are elected by all nodes so their IP addresses are public.

xinxi
2016-05-13 10:27
I am wondering how HyperLedger solves this problem?

simon
2016-05-13 10:30
the nodes only have to be reachable by each other

simon
2016-05-13 10:31
i.e. if you have a 10 node network, you can tolerate 3 byzantine failures. only those 10 nodes need to be able to talk to each other

simon
2016-05-13 10:31
xinxi: does that answer your question?

xinxi
2016-05-13 10:34
Yeah, that makes sense. But in an open network environment, all the IP addresses including the one of the leader are also exposed to the external environment. Can people schedule a DDoS attack against the leader?

simon
2016-05-13 10:34
of course they can

simon
2016-05-13 10:35
but you could just filter this, or connect the nodes via a VPN

xinxi
2016-05-13 10:37
A firewall can surely mitigate the problem to some extend. Actually, Bitcoin miners can also be DDoSed. But the advantage of Bitcoin miners is that within about 8,000 nodes, people don't know which nodes belong to the miners.

xinxi
2016-05-13 10:38
However, in PBFT, the IP of the leader is known to everyone, which makes DDoS very easy.

xinxi
2016-05-13 10:39
Does this make sense to you?

simon
2016-05-13 10:40
the pbft nodes can be completely shielded, and run on a private IP range

simon
2016-05-13 10:41
and only receive messages from trusted clients (which act as proxies)

xinxi
2016-05-13 10:41
Then it is running in a protected environment which is not open to the public.

xinxi
2016-05-13 10:42
So if this is the case, it is good for enterprises to use, not for the some applications like creating a new cryptocurrency?

simon
2016-05-13 10:44
you would use pbft for a permissioned network, i.e. participants need to register, etc.

simon
2016-05-13 10:45
i don't think this would be a design for a cryptocurrency

xinxi
2016-05-13 10:46
OK. Thank you for your clear answer.

simon
2016-05-13 10:46
i hope it was clear :simple_smile:

xinxi
2016-05-13 10:46
However, I’ve heard there is some kind of leaderless Byzantine Fault Tolerance protocol.

simon
2016-05-13 10:47
there are many different protocols

xinxi
2016-05-13 10:47
Will that make it better?

simon
2016-05-13 10:47
i guess it could address some of your concerns

simon
2016-05-13 10:48
but usually you would use a small number of nodes (typically 4-20, most likely never >1000)

simon
2016-05-13 10:48
and in that case DoS is easy to mount on all the nodes

xinxi
2016-05-13 10:49
I see. Now I see the real purpose of hyperledger is an IT infrastructure for enterprises.

xinxi
2016-05-13 10:49
That’s a pretty good aim.

isidoro.ghezzi
2016-05-13 11:06
has joined #fabric-consensus-dev

ghaskins
2016-05-13 14:15
@simon @jyellick are these CI failures consensus related? https://travis-ci.org/hyperledger/fabric/builds/129982579

tuand
2016-05-13 14:21
i started to read the log and travis restarted ?

jyellick
2016-05-13 14:23
@ghaskins: I cannot see the logs, though the `TestBatchCustody` failure is definitely due to consensus

simon
2016-05-13 14:23
there is a PR to fix this

simon
2016-05-13 14:23
(from me)

ghaskins
2016-05-13 16:11
@simon: which #?

ghaskins
2016-05-13 16:13
@tuand: apologies, i restarted the build

ghaskins
2016-05-13 16:15
@simon: nevermind, i found it

ghaskins
2016-05-13 16:15
it wasnt showing up because it was already merged

simon
2016-05-13 16:15
:simple_smile:

cbf
2016-05-15 19:30
@simon: a number of mock tools for golang here https://github.com/avelino/awesome-go#testing

popldo
2016-05-17 02:50
has joined #fabric-consensus-dev

simon
2016-05-17 12:39
i just realized that we need to clear our custody store on state transfer

simon
2016-05-17 12:47
lots of stale complaints

simon
2016-05-17 12:48
i need to fix this

simon
2016-05-17 13:11
i had a patch for these stale complaints - rebasing it for master

simon
2016-05-17 13:12
also i think we should merge pbft-core and batch

simon
2016-05-17 13:12
so that we get rid of the race conditions

jyellick
2016-05-17 14:02
Yes, I think we really need to kill the pbft plugin concept. Sieve is de-emphasized, classic is essentially batch=1. By eliminating the consumer plugin model, it would be easy to have all of PBFT run on a single thread, and eliminate all the nastiness that arises trying to have a batch thread and a PBFT thread.

plucena
2016-05-17 14:49
has joined #fabric-consensus-dev

simon
2016-05-17 16:46
jyellick: i could work on that

simon
2016-05-17 16:46
unless we have something more important to tackle

jyellick
2016-05-17 16:47
It needs to be done at some point, I thought the idea was to wait until after this sprint, but I don't see any harm in starting it sooner

jyellick
2016-05-17 16:47
I think the key will be to do it in small patches

simon
2016-05-17 16:50
what's up for this sprint?

jyellick
2016-05-17 16:50
Well, I think it's supposed to be a stabilizing sprint

simon
2016-05-17 16:50
yea

simon
2016-05-17 16:50
well, getting rid of race conditions is important

tuand
2016-05-17 16:51
Right . clearing out any bug issues

simon
2016-05-17 16:51
another thing is catching up replicas

simon
2016-05-17 16:51
yea i need to drop stale complaints

jyellick
2016-05-17 16:51
Actually, if you wanted, I was planning to work on https://github.com/hyperledger/fabric/issues/1454, but you could and I could focus on some state transfer stuff

simon
2016-05-17 16:51
yes, let's do that

jyellick
2016-05-17 16:52
Sounds good. Start with the periodic null requests?

jyellick
2016-05-17 16:53
What do you think? PR1, add periodic null requests PR2, have backups detect lack of periodic requests, to initiate view change PR3, eavesdrop on view

tuand
2016-05-17 16:53
Simon could you look at #1466?

simon
2016-05-17 20:48
okay

simon
2016-05-17 20:48
sorry, back

rupendradhillon
2016-05-18 04:54
has joined #fabric-consensus-dev

allanpark
2016-05-18 11:31
has joined #fabric-consensus-dev

simon
2016-05-18 16:12
jyellick: are you around?

christophera
2016-05-19 00:01
@christophera has left the channel

mtakemiya
2016-05-19 08:29
has joined #fabric-consensus-dev

simon
2016-05-19 12:24
jyellick: you around?

jyellick
2016-05-19 13:19
@simon I am now

simon
2016-05-19 13:19
hi

simon
2016-05-19 13:20
i forgot what i wanted to ask

simon
2016-05-19 13:20
something about testing

simon
2016-05-19 13:20
i'm trying to write a test without creating a whole network

jyellick
2016-05-19 13:21
Ah, hmm, could be done, but also seems likely to be very verbose

simon
2016-05-19 13:22
but way less finnicky

simon
2016-05-19 13:23
if we could get rid of that goroutine

simon
2016-05-19 13:23
and have a state machine...

jyellick
2016-05-19 13:25
I find so long as `processContinually` is avoided, the current network tests are pretty reliable, though I agree about moving to the state machine

simon
2016-05-19 13:27
oh i think `idleChannel()` doesn't work as expected

simon
2016-05-19 13:28
i think to do this properly, the dispatch would have to only write to idleChan if the select would block otherwise

simon
2016-05-19 13:28
(and no timers are running)

jyellick
2016-05-19 13:31
So `idleChannel`is probably misleading, the promise is if the same thread writes to a pbft channel, then blocks on the `idleChan` you know the pbft thread has finished processing the event you first delivered. It's entirely possible there are other events pending.

jyellick
2016-05-19 13:33
In the case of the mock network code, because it is the thread which delivers the message to the pbft thread, that when `idleChan` unblocks, that message has been processed.

jyellick
2016-05-19 13:34
(so, not a true measure of idleness, but I'm not sure what a clearer name would be, and I also couldn't find a cleaned way to detect the 'message has been processed' state)

simon
2016-05-19 13:34
no you are not guaranteed that

simon
2016-05-19 13:35
the main routine may as well serve the idlechannel first

simon
2016-05-19 13:35
select{} order is not priority

jyellick
2016-05-19 13:35
They're all unbuffered channels

jyellick
2016-05-19 13:35
If you write to one, then read from another, you know that the thread leaves the select in between

jyellick
2016-05-19 13:36
(because the write blocks until the select 'chooses it')

simon
2016-05-19 13:36
no

simon
2016-05-19 13:36
oh!

simon
2016-05-19 13:37
right

simon
2016-05-19 13:37
now i understand

jyellick
2016-05-19 13:37
:simple_smile:

simon
2016-05-19 13:39
bah, this coupling between batch and pbft core makes testing tricky

simon
2016-05-19 13:39
ideally i should mock the pbft core

simon
2016-05-19 13:50
wow you made the custodian very verbose

simon
2016-05-19 13:50
was there a bug in there?

jyellick
2016-05-19 13:53
There was

jyellick
2016-05-19 13:54
The verbosity was more for my understanding, we could probably back it down

jyellick
2016-05-19 13:55
And calling it a bug may be disingenuous, but the behavior was not what the other side seemed to be expecting

simon
2016-05-19 13:56
oh

jyellick
2016-05-19 13:56
The custodian removed things as their timers expired, then the other side expected them to be there on `Restart` and they were not there, so, modified the custodian to re-register on expiration until manually removed

simon
2016-05-19 13:56
oh

simon
2016-05-19 13:56
then we should change the docs and name

simon
2016-05-19 13:57
to reflect either way

jyellick
2016-05-19 13:57
Really, that custodian thread needs to go away too, I think there can be races

simon
2016-05-19 13:57
yes it needs to

jyellick
2016-05-19 13:57
I thought I updated the docs, sorry if I missed some

simon
2016-05-19 13:57
if all is an assembly of state machines, it will be much better

simon
2016-05-19 13:58
and timeouts should be handled by the state machine engine

jyellick
2016-05-19 13:58
Exactly

simon
2016-05-19 14:16
hmmm

simon
2016-05-19 14:17
i think there is some bug with the complainer re-adding requests into custody

simon
2016-05-19 14:19
jyellick: i didn't think the custodian should keep requests around

simon
2016-05-19 14:19
why do you prefer this behavior?

simon
2016-05-19 14:21
i think you've traded one bug for another

simon
2016-05-19 14:25
now there is a race condition where briefly the custodian claims that a request is not under custody, but then is again in custody

simon
2016-05-19 14:26
which can race with another thread removing the request

simon
2016-05-19 14:26
i think i need to revert that

simon
2016-05-19 14:27
but for that i need to understand what bug you addressed with that

jyellick
2016-05-19 14:38
So, I was considering, that yes, the request probably needs to be added while the lock is retained

jyellick
2016-05-19 14:40
And, the problem is, that the callback from the custodian sends a view change, but does nothing to retain the request which expired

jyellick
2016-05-19 14:41
We could revert to the old behavior, but, then `Complain` would actually need to store the complaint that fired, and process it after the view change

jyellick
2016-05-19 14:42
So, maybe to be more clear: vp1 takes req1 into custody req1 custody timer expires, is removed from custodian and calls `Complain` vp1 gets complaint, initiates a view change vp1 is new primary, goes to process everything in its complaint store, which is now empty req1 no longer has any references

jyellick
2016-05-19 14:44
But yes, I agree, there is a race, the `Register` needs to take place with the mutex held to avoid a race, though I think as a real world race, this one is extremely unlikely. It's extremely likely that that thread will have woken up by the time a view change is completely processed and a new view is accepted

simon
2016-05-19 14:45
yes, complaints are not retained, that is on purpose

simon
2016-05-19 14:45
the original custody holder can resubmit them on view change

simon
2016-05-19 14:45
the custodian keeps custody requests until they are successful

simon
2016-05-19 14:45
it doesn't keep complaints

jyellick
2016-05-19 14:47
I see, hmmm, maybe the re-register should go away then, I see that it's being done in `custodyTimeout`

jyellick
2016-05-19 14:49
I'm trying to remember if there were any other fixes that went into `custodian.go` or not

simon
2016-05-19 14:49
wait, how does the notifyRoutine get called again?

simon
2016-05-19 14:49
```if !obj.canceled { expired = &CustodyPair{obj.id, obj.data} } else { c.resetTimer() } ```

simon
2016-05-19 14:49
woops

jyellick
2016-05-19 14:51
`notifyRoutine` is called from a go routine from `resetTimer`

simon
2016-05-19 14:52
yes

simon
2016-05-19 14:52
but then notifyRoutine will have to call resetTimer again

simon
2016-05-19 14:53
not only if the request wasn't cancelled

simon
2016-05-19 14:53
no, vice versa

simon
2016-05-19 14:53
not only if the request was cancelled

jyellick
2016-05-19 14:54
Yes, there's a bug there

jyellick
2016-05-19 14:54
That should probably be converted back to a for loop

jyellick
2016-05-19 14:56
``` for { select { case <-c.timer.C: break case <-c.stopCh: c.stopCh = nil return } c.lock.Lock() var expired []CustodyPair for _, obj := range c.seq { if obj.deadline.After(time.Now()) { break } if !obj.canceled { expired = append(expired, CustodyPair{obj.id, obj.data}) } delete(c.requests, obj.id) c.seq = c.seq[1:] } c.resetTimer() c.lock.Unlock() for _, data := range expired { c.notifyCb(data.ID, data.Data) } } ``` That is what the original implementation looked like

simon
2016-05-19 14:59
yes

jyellick
2016-05-19 15:00
I think it might be worth reverting the changes to the `notifyRoutine` and the tests which rely on the re-register behavior

simon
2016-05-19 15:00
ah, scrum

jyellick
2016-05-19 15:01
The original implementation looks correct to me now,

tuand
2016-05-19 15:01
:smile:

simon
2016-05-19 15:01
:smile:

jyellick
2016-05-19 15:28
@simon: I'm putting together a patch to restore the bulk of your original custodian stuff, it breaks the BatchCustody test, will track it down though

jyellick
2016-05-19 15:38
Okay, so i think this is what's going on, and why my changeset 'fixed' it. The problem is that whenever the length of `expired` is more than 1, then we get back to back complaint callbacks, which basically triggers back to back view change messages, and calling `Restart` does nothing to prevent this, because the `expired` slice is already built.

jyellick
2016-05-19 15:39
So, along with the bugs that my changes introduced, it only ever processes one expired at a time, spawning a new go routine to handle the next, which seems to give the rest of the code time to call `Restart` and get things cleaned up.

jyellick
2016-05-19 15:46
And I guess what I was trying to do, with the go routine that doesn't loop, is to stop more callbacks until someone called into `Register` again, (of course, I introduced later introduced a race by calling back into `Register` from the `notifyRoutine`

jyellick
2016-05-19 16:04
@simon: Any thoughts on how to handle this? Also, custody expirations only broadcast a complaint, it seems like it should also register the complaint with itself, so that it can contribute a view change message, is there a reason why it does not?

simon
2016-05-19 16:29
aaaah

simon
2016-05-19 16:30
yes, all good points

jyellick
2016-05-19 16:30
@simon: I'm also confused as to where the 'resubmitting requests' you referred to exists. I did modify the `Restart` routine to return both the requests in custody as well as the complaints, so that the new primary would process the outstanding complaints it knew about. But, I don't see anywhere that the backup resubmits its requests on view change.

jyellick
2016-05-19 16:31
The `outstandingReqs` in PBFT needed to be zero-ed out on view change, because it is looking for exact batch messages, and really only gets populated on the primary

simon
2016-05-19 16:31
so many bugs in so little code

simon
2016-05-19 16:31
yea

jyellick
2016-05-19 16:36
So it doesn't solve the back to back view change things, but I'm thinking these two changes are correct: 1. When a backup complains, it should register the complaint with itself, so that when it expires, it sends a view change 2. When the backup calls `Restart`, it should loop through all the requests in its custody, and for those which it hasn't complained about (because the new primary should already have that complaint) it should resubmit those to the primary.

jyellick
2016-05-19 16:40
As to fixing the back to back view change problems, what would you think of adding some additional metadata to the request, basically, what view it was taken into custody/complaint. On `Restart`, you could supply the view that things are restarting in. And, then through the callback, you could filter out expirations which are not for your current view. Thoughts @simon?

simon
2016-05-19 16:41
hmm

simon
2016-05-19 16:41
i'll think about it some more

simon
2016-05-19 16:41
this sounds pretty complicated

jyellick
2016-05-19 16:43
Okay, I'll head back off into state transfer land then, let me know if you want to talk more about it

simon
2016-05-19 16:45
sure

simon
2016-05-19 16:45
thanks

simon
2016-05-20 13:33
jyellick: we can't access op.pbft.activeView, because that's racy

simon
2016-05-20 13:33
we really need to put this in a single thread

jyellick
2016-05-20 13:58
@simon: 100% agree, that's been on my radar for a bit, thought about opening an issue for it, but we reference it a lot

simon
2016-05-20 13:58
yea i just opened one

jyellick
2016-05-20 13:58
We could try to track it in both locations, but at the end of the day, we just need to get rid of the second thread

simon
2016-05-20 15:06
jyellick: TestClassicBackToBackStateTransfer is failing because the state transfer seems to happen at seqno 4

jyellick
2016-05-20 15:07
This is presumably after some changes you made? Is this after the null stuff?

jyellick
2016-05-20 15:07
(as I've not seen that test fail locally or in CI)

simon
2016-05-20 15:34
my only explanation is that it did a state transfer more quickly

simon
2016-05-20 15:35
okay, i can't run any of these unit tests


jyellick
2016-05-20 15:38
Is this what's in master, or are there changes on top of it?

simon
2016-05-20 15:39
yes, my changes

jyellick
2016-05-20 15:39
Are they the null eavesdropping changes?

simon
2016-05-20 15:39
nono

simon
2016-05-20 15:39
i'm still tweaking out the custody bugs

simon
2016-05-20 15:39
or rather, i think i've been mostly trying to pass erratic unit tests for the last hours

simon
2016-05-20 15:40
i'm going to submit this PR as is

simon
2016-05-20 15:40
the failures are random here

jyellick
2016-05-20 15:41
I'm not sure why custody should have any effect on it. I think if these unit tests were failing randomly, we would have seen CI complain

simon
2016-05-20 15:41
well, my laptop is slower than everybody's machines

simon
2016-05-20 15:42
it takes different code interleavings

simon
2016-05-20 15:42
which is not too bad

simon
2016-05-20 15:42
because it exposes bugs occasionally

simon
2016-05-20 15:42
and badly written tests (all :)

simon
2016-05-20 15:42
let's see what CI says


simon
2016-05-20 15:43
i'll go outside for a walk, haven't been in fresh air for a week

jyellick
2016-05-20 15:47
Looking at that test, wondering if I don't see a bug

jyellick
2016-05-20 15:51
Actually, not a bug, but, try on line 77 of mock_consumer_test.go bumping `MaxStateTransferTime` up to say, 400 instead of 200

jyellick
2016-05-20 15:52
The test does assume, that given at least 100ms is sufficient time to processes 5 rounds of PBFT

simon
2016-05-20 15:53
uh oh

simon
2016-05-20 15:53
talk about brittle code :slightly_smiling_face:

jyellick
2016-05-20 15:54
Well, it takes 20ms on my laptop to do that

jyellick
2016-05-20 15:56
So, 5x seemed like a reasonable guess as to the power of other machines. Actually blocking for it would be better, but, would have made the changes for the test much more invasive.

jyellick
2016-05-20 15:58
Wow, it is taking over 3 seconds on your machine

jyellick
2016-05-20 15:58
That's roughly 150x as slow

jyellick
2016-05-20 16:01
I can take a TODO for this to modify the test infrastructure to support blocking this call, but I'm a little shocked any of the timing related tests run successfully for you

vukolic
2016-05-20 17:34
@tuand: @kostas @simon @jyellick Re Sharon's email

vukolic
2016-05-20 17:34
and #1454 and #1120

vukolic
2016-05-20 17:34
if they pause ONE node and want it to resume - that node falls directly into 1454

vukolic
2016-05-20 17:35
as the view does not change - so this is the most severe case and it won't be addressed by 1120

vukolic
2016-05-20 17:35
there is no way to make this node come back - as we cannot change the view based on the decision of one node, nor can we resume the node in a view

vukolic
2016-05-20 17:35
so we need SUSPECTs

vukolic
2016-05-20 17:35
my question is

vukolic
2016-05-20 17:35
can you implement configurable SUSPECTs?

vukolic
2016-05-20 17:36
meaning - there is a config param in which the whole trick can be switched off?

vukolic
2016-05-20 17:37
the reason is I would like to avoid adding instability to the code, so - if we decide to implement - would like to sandbox the SUSPECTs as much as possible - so they do not necessarily add to more instability

jyellick
2016-05-20 17:43
There's nothing technically infeasible about making it configurable, though obviously that is more work than simply forcing it to be enabled

jyellick
2016-05-20 17:44
@vukolic: I responded to your email, but, per the discussions in those issues, I thought it had been agreed that 1454 was the priority, and that 1120 should be deferred

jyellick
2016-05-20 17:47
What might be a simpler solution, which I'm not sure I like, but would like your thoughts on. What if we mandated view changes every n checkpoints? Combined with periodic null requests, this would at least guarantee us 'eventual consistency', and without introducing new messages.

jyellick
2016-05-20 17:48
Maybe that's a horrible idea, but it seems like it could be implemented much more easily than the SUSPECT mechanism. I've also heard that despite PBFT's design, some users might not be comfortable with the idea that the primary never changes, so long as no one believes it to be byzantine, and this would address that fear as well.

vukolic
2016-05-20 17:49
it is not necessarily bad idea - some protocols do this anyway (see Aardvark from UT Austin)

simon
2016-05-20 17:49
hi

vukolic
2016-05-20 17:49
the drawback is killing the performance when we have a good leader

vukolic
2016-05-20 17:49
but I agree it may be easier to implement

vukolic
2016-05-20 17:51
ok - do try it

vukolic
2016-05-20 17:51
just make this configurable

vukolic
2016-05-20 17:51
with special value that switches it ioff

vukolic
2016-05-20 17:51
e.g., we can configure after how many reqs we switch the leader

vukolic
2016-05-20 17:52
but if we put -1 than we switch it off (or whatever value)

jyellick
2016-05-20 17:52
Right, I think that would be a much more direct change, and much easier to enable/disable than something like SUSPECT

vukolic
2016-05-20 17:52
agree

vukolic
2016-05-20 17:52
ok solved

vukolic
2016-05-20 17:52
:slightly_smiling_face:

jyellick
2016-05-20 17:53
Haha, great, I'll throw a comment on 1120

vukolic
2016-05-20 17:53
and 1454

simon
2016-05-20 17:53
i'm still stuck trying to remove small bugs from the code

simon
2016-05-20 17:53
rather than adding features

vukolic
2016-05-20 17:54
by the way - let's make sure the test case keeps adding requests

vukolic
2016-05-20 17:54
otherwise there is no use of this :slightly_smiling_face:

tuand
2016-05-20 17:55
already talked to the testers

simon
2016-05-20 17:55
i think before we add any convenience feature (replicas catching up), we need to address any subtle bugs

vukolic
2016-05-20 17:55
@simon anything more specific?


vukolic
2016-05-20 17:56
I thought we are getting rid of pbft-classic and merging core and batch?

jyellick
2016-05-20 17:57
I think we should. The problem is, we either need to duplicate the code, and do this, or we need to remove Sieve.

jyellick
2016-05-20 17:58
Merging classic and batch effectively requires eliminating the pbft plugin concept in the code.

vukolic
2016-05-20 17:58
Sieve should be implementable on the merged pbft

vukolic
2016-05-20 17:58
probably with quite some refactoring

jyellick
2016-05-20 17:58
Not the way the code is structured today. Sieve also has a number of outstanding bugs with no clear solutions.

tuand
2016-05-20 17:59
there are a few more issues floating e.g. #1538 #1466

jyellick
2016-05-20 17:59
I would agree it could be implemented on top of the merged PBFT, but it would certainly be broken in the meantime.

vukolic
2016-05-20 17:59
that's clear

simon
2016-05-20 18:00
jyellick: https://github.com/hyperledger/fabric/issues/1538 that's state transfer being confused?

jyellick
2016-05-20 18:00
(And 'breaking it in the meantime', to me is effectively removing it)

simon
2016-05-20 18:00
so:

jyellick
2016-05-20 18:01
No, I think that is the longstanding Sieve bug that Sieve must advertise the block hash before the block is committed, so when state transfer goes to retrieve it, it is a race to see whether the block is written or read first.

vukolic
2016-05-20 18:01
seems bishop says #1538 is looking like #1120?

vukolic
2016-05-20 18:01
from a skim I'd say it at least looks similar

simon
2016-05-20 18:01
if we make batch agree on (and sign in COMMIT) the next block (hash), instead of on a set of transactions, we already have half the infrastructure to implement sieve in batch

jyellick
2016-05-20 18:01
And any Sieve and state transfer that is not a single block catchup, I think is broken right now

jyellick
2016-05-20 18:02
Or maybe not any

simon
2016-05-20 18:02
jyellick: but statetransfer complains about non-matching correlation IDs

jyellick
2016-05-20 18:02
But any time it tries to recover the sequence number from the block

vukolic
2016-05-20 18:02
I have issue following multithreaded slack conversations...

simon
2016-05-20 18:02
:slightly_smiling_face:

simon
2016-05-20 18:02
sorry

vukolic
2016-05-20 18:03
no parallel processing here...

simon
2016-05-20 18:03
hehe

simon
2016-05-20 18:03
decades of IRC trained me

vukolic
2016-05-20 18:03
so anyway

vukolic
2016-05-20 18:03
1535 is orthogonal to 1120

simon
2016-05-20 18:03
yes they are

vukolic
2016-05-20 18:03
so Jason pls try what you just suggested and make it -1able

vukolic
2016-05-20 18:04
@simon as for sigs on commits - we could have the same

vukolic
2016-05-20 18:04
this was 1182, right?

simon
2016-05-20 18:06
i forget the number

simon
2016-05-20 18:07
we have a problem with sieve persistence and maybe also state transfer

simon
2016-05-20 18:07
but with signed commits, maybe we can treat them as checkpoints

simon
2016-05-20 18:07
then we can get rid of checkpoints, and use the signed commits to catch up in a granular fashion

simon
2016-05-20 18:09
jyellick: TestBatchStaleCustody failed in CI...

simon
2016-05-20 18:09
all stupid racy tests

jyellick
2016-05-20 18:09
Yep, there's that race there, which we ID-ed yesterday

simon
2016-05-20 18:09
in the test, not in the code?

jyellick
2016-05-20 18:09
Nope, in the code

jyellick
2016-05-20 18:10
@vukolic: @simon before you sign off, I'd love to talk with you about some of what's been discussed in RTP regarding 'consenting on output', and signing blocks, etc.

simon
2016-05-20 18:10
yes

simon
2016-05-20 18:11
should we do a quick call, or do you prefer here?

simon
2016-05-20 18:11
here documents it for eternity

jyellick
2016-05-20 18:11
Either works for me, whichever you guys are more comfortable with

simon
2016-05-20 18:11
here is fine

jyellick
2016-05-20 18:13
Okay, so, the key issue is, people do not like the fact that we only 'consent on the input', and that even with 100% deterministic transactions (ie, the postimage stuff), people are still not happy about that. Further, they do not like the fact that the only time we get a guarantee that the network is in a particular state, is at checkpoints, which means, there are are up to k-1 blocks, who's content, they would argue they cannot trust.

jyellick
2016-05-20 18:15
Obviously we could do a round of signing after every block, and store some signatures on the block, but, at that point, we've really lost a lot of PBFT, because if we're going to be doing signatures at every round, why did we choose a protocol that deliberately avoids them? I'm obviously not a protocol expert, but I believe that with the use of signatures, it is possible to perform less chatty byzantine consensus.

jyellick
2016-05-20 18:16
They further hate the idea that the although the network eventually halts on non-determinism, there are committed blocks which should not have been in this case.

jyellick
2016-05-20 18:16
So, the proposal I would give, is the following. Today, we correspond a COMMIT message, and to a block. and issue checkpoints which confirm we all have the same view of the blockchain.

jyellick
2016-05-20 18:17
I would propose, that we make COMMIT messages correspond to executions, and CHECKPOINT messages correspond to blocks.

jyellick
2016-05-20 18:17
Then, we can sign checkpoint messages, and since this is at a configurable interval, we can control the overhead from the signatures.

jyellick
2016-05-20 18:18
And further, once a replica receives f+1 signatures, it can broadcast them to its NVPs, which effectively acts as a 'strong read' for that block.

jyellick
2016-05-20 18:19
We could actually bump that to 2f+1 signatures, if we wanted to ensure that the network will continue to make progress.

jyellick
2016-05-20 18:19
(as in, only consider the read strong if we know we can build more blocks upon the current state)

simon
2016-05-20 18:19
how does that all relate to the new consensus architecture?

jyellick
2016-05-20 18:20
Well, in the MVCC+postimage world, there's no state to shift around, so having a copy of the blockchain is sufficient to prove the state of the world, especially if you are only interested in the 'latest' state of a key, so a strong read on a block hash, is trivially a strong read against all the previous key values.

jyellick
2016-05-20 18:22
Because we only actually call 'commit' in the ledger at the checkpoint, and only once we have the signatures, everyone gets the promise that any reads they do will always be against an unequivocally committed state (whereas before, if nondeterminism had diverged the chains, it is possible that it could be wrong).

jyellick
2016-05-20 18:23
And assuming the checkpoints have signatures, you don't even need to introduce a strong read, and you don't have clients needing to deal with trying to connect to multiple VPs.

jyellick
2016-05-20 18:24
The complaint from @kostas I believe is that you basically force higher latency on people. That if you want to perform a read, and you're 'sure' that the transactions are deterministic, you can get a fresher result if we commit the blocks with the COMMIT messages.

simon
2016-05-20 18:24
well, you still could be lagged

simon
2016-05-20 18:25
but yes

jyellick
2016-05-20 18:25
Absolutely, but what you get will never be incorrect for that version.

simon
2016-05-20 18:25
has this been an ongoing conversation? or did that come up recently?

jyellick
2016-05-20 18:26
It happened late last week on a whiteboard randomly when @kostas, @sheehan, @binhn and I happened to be in the same room

jyellick
2016-05-20 18:27
I think it was a point @sheehan made, is that no one actually wants a "I'm 99% sure this read is reading something that will be committed", that reporting data before we can verify it via a checkpoint is really of very little value.

simon
2016-05-20 18:27
well

simon
2016-05-20 18:28
why would it be wrong?

jyellick
2016-05-20 18:30
With deterministic transactions, it should never be wrong. But, that's apparently not an argument the community is willing to accept, and to some extent, I get it. Your RAM is hit by some cosmic radiation and you screw up the execution. I think it's a bit of a silly game to play "I can't trust myself", but something could have gone wrong, and if you get agreement from f+1, or 2f+1 other replicas in the form of signatures, you can have a much higher confidence. You've essentially got consent on the output.

simon
2016-05-20 18:30
:slightly_smiling_face:

simon
2016-05-20 18:31
your ram is hit after you calculate the hash, but before you write it to disk...

vukolic
2016-05-20 18:32
the problem in making sure you have the right hash

jyellick
2016-05-20 18:32
It all seems pretty unlikely to me... and it would only help with benign faults. If it's a malicious fault, like, maybe somebody fixes your crypto sig checker to always return true.... then you have a problem.

vukolic
2016-05-20 18:32
is that no amount of "protection" is going to save you from a "cosmic ray"

vukolic
2016-05-20 18:32
we could run agreement on output/input/NBA finals outcome

vukolic
2016-05-20 18:32
and then once we are done

vukolic
2016-05-20 18:32
with whatever

vukolic
2016-05-20 18:32
"cosmic ray" strikes

vukolic
2016-05-20 18:32
so, how can you be sure?

jyellick
2016-05-20 18:33
Yes, I agree, I think real world, analytically, once you start not trusting yourself, you've got nothing. Pretty sure philosophers have pondered on this for a long time. But, apparently, psychologically, people really want consent on the output.

simon
2016-05-20 18:33
:slightly_smiling_face:

simon
2016-05-20 18:34
arguably, the less computation happens between consenting and final data, the less bug surface you have

jyellick
2016-05-20 18:34
And, you also need a little less trust between the clients and their VP, if they can validate the other VP's signatures independently.

vukolic
2016-05-20 18:34
so we can agree on output - this is what sieve does - it may be buggy but conceptually it is implementable

vukolic
2016-05-20 18:35
with mvcc+ postimage we will have a leaderless approach to the same thing

vukolic
2016-05-20 18:35
which should inherently be less buggy

vukolic
2016-05-20 18:35
it is however IMPOSSIBLE

vukolic
2016-05-20 18:35
to have a 3 round ala PBFT protocol

vukolic
2016-05-20 18:35
that agrees on output

vukolic
2016-05-20 18:35
I can write a proof

vukolic
2016-05-20 18:35
so

jyellick
2016-05-20 18:35
Well, I think the difference between this and Sieve would be. "Assume your transactions are deterministic, and if they're not, it's fine if the network halts, but, never commit anything to the chain, unless the network consents on the output"

vukolic
2016-05-20 18:36
hacking into PBFT to agree on output, without adding more comm is doomed to fail

jyellick
2016-05-20 18:36
What is wrong with not 'committing' the block, until you have a stable checkpoint

vukolic
2016-05-20 18:36
ah but that is more msgs

vukolic
2016-05-20 18:36
like k*3

vukolic
2016-05-20 18:36
:slightly_smiling_face:

jyellick
2016-05-20 18:37
For sure, I agree it is. So, it's basically cheating. People dislike that PBFT doesn't validate at every round.... so, we say fine, we'll let PBFT run as it's designed and do what it does, and quickly. But we only officially agree on the result of PBFT at the checkpoint interval.

vukolic
2016-05-20 18:38
btw, that does not explain what happens if the checkpoint reveals a non-det tx

vukolic
2016-05-20 18:38
what do you do then?

jyellick
2016-05-20 18:38
Halt

vukolic
2016-05-20 18:38
checkpoint after every commit is like a 4th message already

vukolic
2016-05-20 18:38
well we should not halt right?

vukolic
2016-05-20 18:39
that's a DoS

vukolic
2016-05-20 18:39
I write non-det chaincode and kill the blockchain

jyellick
2016-05-20 18:39
In the MVCC+Postimage world, it should not be possible

jyellick
2016-05-20 18:39
Postimage is inherently deterministic

vukolic
2016-05-20 18:39
ok, so MVCC+postimage solves the thing and if you look at the pattern it is exactly like sieve

vukolic
2016-05-20 18:39
w/o leader

vukolic
2016-05-20 18:39
fact it does not have a leader

vukolic
2016-05-20 18:40
makes it more susceptible to concurrency clashes

vukolic
2016-05-20 18:40
but otherwise the pattern is the same

vukolic
2016-05-20 18:40
as we will have this in v2

vukolic
2016-05-20 18:40
I think we should do 0 in v1

jyellick
2016-05-20 18:40
Well, I would say the difference is that in Sieve, you do not know the output going in, you have to agree on the output

jyellick
2016-05-20 18:40
In the MVCC+Postimage you only have to agree whether an output is correct or not.

vukolic
2016-05-20 18:40
it is a minor change to have the same as in MVCC+postimage

vukolic
2016-05-20 18:40
leader could execute and propose a hash

vukolic
2016-05-20 18:41
and replicas would not execute themselves but confirm the leader or not

vukolic
2016-05-20 18:41
at some point I told chet we could easily have postimage in Sieve

vukolic
2016-05-20 18:41
but

jyellick
2016-05-20 18:41
Right, that is essentially the endorsers requiring f+1 policy

vukolic
2016-05-20 18:41
exactly

vukolic
2016-05-20 18:41
but

vukolic
2016-05-20 18:41
MVCC is where I clashed with Chet - because of the leader vs leader-less approach

vukolic
2016-05-20 18:41
and then

vukolic
2016-05-20 18:42
MVCC is simpler

vukolic
2016-05-20 18:42
so

vukolic
2016-05-20 18:42
let's try it

vukolic
2016-05-20 18:42
(Chet did say that this - simplicity - is the main reason MVCC is superior in his view)

vukolic
2016-05-20 18:42
and I can concur with that

vukolic
2016-05-20 18:43
if we end up with concurrency clashes all over the place - we sit and rethink the leader-based (or multi-leader) design

jyellick
2016-05-20 18:44
So, yes, you basically go leaderless, so that you assume your endorser is non-byzantine and you fix a lot of problems.

vukolic
2016-05-20 18:44
leaderless is the issue for concurrent tx changing the same objects

vukolic
2016-05-20 18:44
in UTXO this is largely a non issue

vukolic
2016-05-20 18:44
but we have a key-value store

vukolic
2016-05-20 18:44
will depend on the granularity of the data model - how often do we have concurrency issues

jyellick
2016-05-20 18:45
But getting back to the verifying the output side. It seems obvious to me, that if we want the system to scale, it's impractical to have clients connect to f+1 peers to perform a strong read. So it makes a lot of sense to me, to simply sign checkpoints, and then broadcast bundles of them, as implicit strong reads.

vukolic
2016-05-20 18:45
in principle this does not solve the issue

vukolic
2016-05-20 18:45
you can sign whatever

vukolic
2016-05-20 18:45
I as a Byzantine replica can serve to my clients stale reads

vukolic
2016-05-20 18:45
you need to go to more to be sure

jyellick
2016-05-20 18:46
So, this is two different problems to me.

jyellick
2016-05-20 18:46
One is, is the data that I'm reading definitely correct, at the version I'm being sent it.

jyellick
2016-05-20 18:46
IE, are blocks up through n correct.

vukolic
2016-05-20 18:46
that is signature-fixable yes

jyellick
2016-05-20 18:46
The other half, is "is it current", and the simple answer is, I don't think that's answerable. Period.

jyellick
2016-05-20 18:47
You could ask "is it current as of time XXXX" and maybe you can answer that.

jyellick
2016-05-20 18:47
But it's an asynchronous system, nothing's atomic, even at it's most basic level, by the time the reply comes over the wire, it could no longer be current.

vukolic
2016-05-20 18:47
sure

vukolic
2016-05-20 18:48
anyway - great chatting here

vukolic
2016-05-20 18:48
seems we also made progress

vukolic
2016-05-20 18:48
need to take off

jyellick
2016-05-20 18:48
Okay, would love to continue this conversation at some point.

vukolic
2016-05-20 18:48
sure

sheehan
2016-05-20 18:49
@vukolic: "how often do we have concurrency issues” yes, that is the question that we struggle with

sheehan
2016-05-20 18:49
if choosing mvcc

jyellick
2016-05-20 18:50
I know we've frequently talked about adding a trusted time service. Assuming we have periodic checkpoints, and some way to sync on time, could we not include a timestamp in the signed checkpoint, to give a guarantee on "current as of time XXXX"?

vukolic
2016-05-20 18:58
time...


vukolic
2016-05-20 19:02
@sheehan: we will need to program with this in mind - it makes our life simpler as fabric developers - but whoever programs the chaincodes and defines objects will have a tougher job

vukolic
2016-05-20 19:06
but one should not implement chaincode for presidential election that says "if vote='trump' then trump:=trump+1"

ghaskins
2016-05-21 03:08
@jyellick: I would argue its actually three problems: you mention “correct" and “current"…quorum signatures solve the first, and, to your point, “current” is difficult to prove…however, there is a middle state and that is whether anything has been omitted outside of reasonable asynchronous issues. Lets call the third one "omission-detection”. I would argue that 1) strong reads are a solution to the omission detection problem and that 2) “current” isn’t really an issue we need to worry about because this is solved a different way (e.g. transaction confirmation).

ghaskins
2016-05-21 03:10
To put it another way, I don’t necessarily care if I am “current” but I do want to figure out if information is being withheld from me (up to the limits of byzantine tolerance of the network) and I do want to monitor whether transactions confirm in a reasonable timeframe.

ghaskins
2016-05-21 03:14
@vukolic: I am not following your argument about variable increment within chaincode. Could you elaborate?

jyellick
2016-05-21 21:44
@ghaskins: I think what I'm driving at would be, 'Wouldn't it be nice, if the network periodically emitted a "this is my block height, and its hash, and this is the time", in a way that you could trust (say, as a bundle of signed messages). Because the time is always advancing, and you know you should receive and advertisement at least once every n seconds, you can be certain that you're not having the current state hidden from you. It seems like in many situations, this would be a nice substitution for a strong read, and it would also be much less work for the network.

jyellick
2016-05-21 21:46
Then, all the clients (NVPs) need, is a copy of the blockchain, and a 'recent' chain state attestation, and they could have 100% confidence in their reads (acknowledging the fact that sure, the state may have changed beyond time t, but as you say, this is difficult to prove even with a strong read)

chetsky
2016-05-21 21:53
@jyellick: I think if you look at "time, crontab", you'll find that combining that with the standard way that a client can read tran-outcome from 2f+1 clients, is enough. At least, at first blush.

jyellick
2016-05-21 21:57
@chetsky: There's been considerable interest in signing blocks. Now, this is somewhat counter to the PBFT idea, because it tries to eliminate having to constantly be signing messages (which is a bottleneck). Further, there's also been considerable desire to 'consent on output', which brought us to the idea of switching block creation from PBFT 'COMMIT's to PBFT CHECKPOINTS.

jyellick
2016-05-21 21:58
So, if we're signing checkpoints, and we're emitting those periodically anyway, they seem like a convenient mechanism to 'push' strong reads

jyellick
2016-05-21 22:00
(Especially since if block creation is at checkpoint, we must be manufacturing these messages already, so, there's limited additional overhead)

chetsky
2016-05-21 22:00
@jyellick of what interest is a strong read, other than to ensure you're geting up-to-date info? If one is truly worried that one's bank is .... lying to one .... truly, there are more effective solutions.

chetsky
2016-05-21 22:00
and if it's merely to ensure your data is up-to-date, watching the clock trans is enough

chetsky
2016-05-21 22:01
heck, as I think about it, why bother watching 2f+1 peers? just listen to one peer for clock trans

chetsky
2016-05-21 22:01
People who think they won't get their faces ripped off, if they run a full node and validate all trans, are delusional, after all

ghaskins
2016-05-21 22:03
@jyellick: I would agree with that

ghaskins
2016-05-21 22:04
“strong read” as it has been discussed might be more relevant on the query side

jyellick
2016-05-21 22:04
I definitely need to catch up to speed on "time, crontab", I think I get the gist of it it, and can certainly see how that could eliminate the need for something like timestamped checkpoints.

jyellick
2016-05-21 22:05
Maybe it's just a use case that doesn't exist, but, I hear people asking "How can I validate the correctness of my copy of the blockchain, as an NVP, without having to query the network"?

jyellick
2016-05-21 22:06
(And clearly, to verify current-ness, you need to contact the network, but not correctness)

chetsky
2016-05-21 22:06
@ghaskins @jyellick Guys, y'know, I think there's a far graver issue: from my understanding (and talking with Jason, Marko, others in the BFT community) the definition of "correct" in BFT, is -incompatible- with ANY definition of fault-tolerant in the fault-tolerance literature.

chetsky
2016-05-21 22:07
I think, absent a serious and careful investigation (use hot lights, rubber hoses stress positions, don't stint) , it's not at all clear that PBFT actually is a suitable protocol.

chetsky
2016-05-21 22:08
People who aren't banks, and ask "how can I verify the correctness of my copy ..." .... <sigh> (oh, can I borrow a few thousand dollars for a few days?)

jyellick
2016-05-21 22:10
I completely agree that PBFT as it comes out of the box, and the desirable behavior for a fault tolerant distributed system are rather different. I'm not convinced it's not solvable, but certainly, there seem to have been some disconnects between people's expectations of BFT.

chetsky
2016-05-21 22:12
re: hot lights, you should not accept "workarounds". And whatever solution, must work in the face of maximum traffic (hence, must come with built-in flow-control)

chetsky
2016-05-21 22:12
b/c having floor(1/3) of your nodes lag behind on the busiest day of the year, WILL result in lawsuits

jyellick
2016-05-21 22:15
@chetsky: I agree, we need a much stronger clear long term architecture to handle this, and hopefully @vukolic can help us achieve this. But, in the interest of walking before running, ensuring that every non-byzantine node can catch up and participate under normal load (not DOS level busyness) is a good first target.

chetsky
2016-05-21 22:16
:wink:

simon
2016-05-23 08:12
chetsky: if 1/3 of your nodes can be byzantine, how can you avoid leaving 1/3 of your nodes to their own devices and ignore their slowness? They may try to catch up, but clearly they're acting (relatively) byzantine, because they're slower than the rest.

vukolic
2016-05-23 14:11
@ghaskins roughly, MVCCs mentioned above imply that out of two concurrent tx modifying a single object at most one may go through

vukolic
2016-05-23 14:12
so implementing that counter like in that chaincode example would clash left and right

ghaskins
2016-05-23 14:12
i see, you are referring to avoiding hotspots

ghaskins
2016-05-23 14:12
got it, makes sense

vukolic
2016-05-23 14:13
in UTXO data model this is largely a non-issue - as objects (coins) are not supposed to be modified concurrently

chetsky
2016-05-23 14:13
@simon absolutely, faulty nodes get no guarantees. But if all nodes are fault-free, and the network is fault-free, then the protocol should not -induce- faults, should not -induce- instability

vukolic
2016-05-23 14:13
as we move to key value store we may need to take care how we program chaincodes with MVCCs

simon
2016-05-23 14:14
well should

simon
2016-05-23 14:14
but how do you prevent this?

chetsky
2016-05-23 14:15
I don't know enough aboutextant BFT protocols to be able to answer that. I do know that as -systems- guy, I wouldn't start down the road of building a -system- without having a protocol that had the properties I outline above.

chetsky
2016-05-23 14:16
In short, I should be able to operate it at maximum throughput, at overload of ingress requests, and as long as no nodes or network hops are faulty, the system should not itself induce faults.

chetsky
2016-05-23 14:16
that's what flow control is for, after all.

simon
2016-05-23 14:16
yes

simon
2016-05-23 14:17
how do you differentiate between a faulty node and a slightly slow node?

simon
2016-05-23 14:17
to determine your flow control boundaries/metrics

chetsky
2016-05-23 14:17
I think we know the difference in practice, eh?

simon
2016-05-23 14:17
no?

simon
2016-05-23 14:18
not talking about a crashed node

chetsky
2016-05-23 14:18
agreed. all I'm saying is, you -do- know what's the difference between faulty and "slightly slow".

simon
2016-05-23 14:18
but just faulty - e.g. bad cable, network is slow

chetsky
2016-05-23 14:18
slightly slow nodes don't exceed their timeouts

simon
2016-05-23 14:18
ah! timeouts

simon
2016-05-23 14:19
what timeout would you set?

simon
2016-05-23 14:19
static or determine the timeout dynamically

simon
2016-05-23 14:19
not how many seconds - of course that depends

vukolic
2016-05-23 14:20
well - definitely deployment dependent as deployment on LSEGs floor and over WAN are not going to be the same

vukolic
2016-05-23 14:20
whether dynamic this is yet another issue

vukolic
2016-05-23 14:21
(I am talking about timeouts obviously)

simon
2016-05-23 14:21
yes

chetsky
2016-05-23 14:22

chetsky
2016-05-23 14:23
in NO protocol, is it acceptable for one of the conditions to be "lower your input rate"

chetsky
2016-05-23 14:23
i can tell you a funny, funny story about a certain enterprise app-server product from 1998, in that regard.

chetsky
2016-05-23 14:24
seemed, it had some bugs in its ingress-request-processing code.

chetsky
2016-05-23 14:24
with high likelihood, with >60 concurrent requests, it would crash.

chetsky
2016-05-23 14:24
so when it shipped, it was with that proviso to customers.

chetsky
2016-05-23 14:25
now, since nobody knew -why- it crashed, that meant that it could happen at <60 concurrent reqs, and certainly nothing prevented load from going above that (per node)

chetsky
2016-05-23 14:25
after all, there are load-spikes in teh real world.

chetsky
2016-05-23 14:26
suppose you have such a system, with N nodes. And you lose a node. the # of reqs/node has just increased. So you get another crash. And then another,etc.

chetsky
2016-05-23 14:26
Flow control is about preventing self-induced instability.

simon
2016-05-23 14:39
yes, sure

simon
2016-05-23 14:40
but byzantine flow control use the slowest F replicas as the performance limit gauge

simon
2016-05-23 14:40
because those F replicas may be deliberately slow

muralisr
2016-05-23 14:43
sorry for jumping in (and being fairly clueless, please forgive if dumb q. :slightly_smiling_face: ). Is it possible to summarize the discussion so I can catchup ?

simon
2016-05-23 14:43
hi muralisr

muralisr
2016-05-23 14:43
I can take it offline if you like

jyellick
2016-05-23 14:44
I think it can be summarized quickly here (might be nice for others too)

simon
2016-05-23 14:44
we're talking about the fact that plain PBFT will leave F nodes behind

simon
2016-05-23 14:44
i.e. the fastest 2F+1 nodes make progress, and the remaining F are lagging behind

simon
2016-05-23 14:44
and the protocol is not "waiting" for them

jyellick
2016-05-23 14:45
To put it another way, PBFT is designed not to allow f byzantine replicas to negatively impact the network. So, even when f nodes are simply 'a little slow', the network ends up leaving them behind, because it can't differentiate between "doing their best but slow" and "trying to slow the network down".

vukolic
2016-05-23 14:47
what may work there is agressive to moderate - but not conservative - timeouts - to wait for all replicas to catch up most of the time

vukolic
2016-05-23 14:47
as in wait for 2f+1 and expir of timer

vukolic
2016-05-23 14:48
where timer is agrweesive/moderate

jyellick
2016-05-23 14:48
@simon: @vukolic @chetsky. It seems like adaptive timeouts (we may have even discussed this in context of XFT) could make sense. Say, the any node has to be no slower than 80% the speed of the 2f+1th fastest node. Otherwise we consider it byzantine and move along.

vukolic
2016-05-23 14:48
sth like that

simon
2016-05-23 14:48
so byzantine nodes can only slow down the network by 20%

jyellick
2016-05-23 14:49
Exactly. Obviously you could make it configurable, but then the negative impact would be bounded.

simon
2016-05-23 14:50
i think it would be interesting to analyze the equilibrium between waiting for slower nodes and having to do state transfer

vukolic
2016-05-23 14:50
@simon write a paper! :slightly_smiling_face:

simon
2016-05-23 14:50
haha

simon
2016-05-23 14:50
i'm having trouble removing all the bugs i keep adding

jyellick
2016-05-23 14:50
From a business perspective, I'm not sure it's quite that simple. "Doing state transfer" isn't an acceptable alternative to participating in the network.

simon
2016-05-23 14:51
why not?

jyellick
2016-05-23 14:52
Maybe @chetsky can be more articulate about it, but if the f slowest nodes never execute a transaction but just get a slightly laggy state transferred version of the chain, that won't fly.

jyellick
2016-05-23 14:52
(Presumably, because they would be losing a business advantage by knowing the results at a later time, I'd think)

simon
2016-05-23 14:53
why not?

simon
2016-05-23 14:53
well, then they should use faster machines?

jyellick
2016-05-23 14:53
But then you get into an arms race, and you still leave the f slowest behind.

simon
2016-05-23 14:53
yes

simon
2016-05-23 14:53
darwinistic computing

ghaskins
2016-05-23 14:53
So I can play this back: IIUC in the perhaps naive approach, we assume that the network will operate at approximately the speed of the fastest 2f+1 nodes, leaving f nodes to potentially fall behind…but falling behind has its own cost in that lagging nodes must enter a state transfer protocol which is more expensive than steady state, perpetuating the load on the network

simon
2016-05-23 14:54
yes, that's what i suggested above

vukolic
2016-05-23 14:54
guys - we need to work here with well specified requirements - until then this is just the guesswork of what will businesses require

ghaskins
2016-05-23 14:54
thus impacting the speed of the presumably fastest 2f+1 because then they are busy catching the others up

jyellick
2016-05-23 14:55
It seems like a clear requirement, that if all nodes are trying to participate in a non-byzantine way, they should all be able to.

ghaskins
2016-05-23 14:55
and the proposal is to introduce a mitigating strategy to flow control the network in general to reduce state transfer pressure?

simon
2016-05-23 14:55
but what does "participate" mean?

jyellick
2016-05-23 14:55
PBFT as designed, leaves the f slowest nodes never executing transactions, always trying to catch up, whenever the network is under any sort of serious load.

simon
2016-05-23 14:55
ghaskins: no, people don't like it that until state transfer triggers, nodes are "behind"

chetsky
2016-05-23 14:56
guys, can I suggest you look at (for instance) tihe Ensemble system

jyellick
2016-05-23 14:56
@ghaskins: It is not about preventing state transfer, it's about allowing all nodes to be 'current' in terms of participating in the ordering etc.

chetsky
2016-05-23 14:56
real world distributed systems designed for massive data-flows all include flow-control

chetsky
2016-05-23 14:56
it isn't a 'mitigation"

simon
2016-05-23 14:56
but they can participate in ordering

simon
2016-05-23 14:56
they just don't execute

chetsky
2016-05-23 14:56
it's a core part of what they do

vukolic
2016-05-23 14:56
@jyellick this is not sufficient as network faults may make them appear to the rest of the network as fautly

ghaskins
2016-05-23 14:57
@chetsky: what i am driving at is: isnt the flow control kind of already there?

jyellick
2016-05-23 14:57
@vukolic: I agree, at some point nodes need to be left behind, but as a normal case high load operating principle "we leave the f slowest behind", that is a problem.

vukolic
2016-05-23 14:57
that we may try to solve with that agressive/moderate (and perhaps dynamic) timeout

ghaskins
2016-05-23 14:57
e.g. theres nothing you can do about the slower nodes per se…the network will run at the speed of the fastest 2f+1…that is a form of flow control right there

ghaskins
2016-05-23 14:58
at least at the client txn confirmation rate level, not necessarily the consensus protocol level

ghaskins
2016-05-23 14:59
personally, I dont see the f-nodes lagging as a problem…thats the nature of being byzantine resistant

ghaskins
2016-05-23 14:59
i think its irreducible, actually

vukolic
2016-05-23 14:59
@ghaskins - nice to hear this

jyellick
2016-05-23 14:59
@ghaskins: It's great to get that sort of feedback, but I have gotten just the opposite from many (for instance @bcbrock)

ghaskins
2016-05-23 15:00
i think the moment you say “the network must be in lockstep” you immediately discard the notion of being byzantine resistant

vukolic
2016-05-23 15:00
we actually have some methods to deal with that - some being discussed here but others may be more invasive

vukolic
2016-05-23 15:00
(while staying byzantine resilient of course)

vukolic
2016-05-23 15:01
anyway @jyellick @simon seems we are going for: 1) periodic leader rotation and 2) wait for f slowest nodes but not too much

ghaskins
2016-05-23 15:03
so back to my comment about mitigating strategy, unless there is some quantifiable negative impact to the 2f+1 when f are slow (such as increased state-transfer pressure), let them be slow

vukolic
2016-05-23 15:04
@ghaskins: this was my initial answer but apparently users may have objections to this

vukolic
2016-05-23 15:04
so 1) and 2) would adress this to a large extent and would be configurable

vukolic
2016-05-23 15:04
so you can switch them off

ghaskins
2016-05-23 15:04
it strikes me as “cake and eat it too"

jyellick
2016-05-23 15:04
@ghaskins: As mentioned, this is great feedback, we'll make sure we can at least have this behavior via config

vukolic
2016-05-23 15:05
and let dragging replicas - well - drag...

jyellick
2016-05-23 15:05
(and it is what exists today)

ghaskins
2016-05-23 15:07
@vukolic: I would love to understand more about the ideas you mentioned above that can deal with this

ghaskins
2016-05-23 15:07
even if its just academic curiosity…because I am fine with the lagging


jyellick
2016-05-23 15:08
(For some of the strategy)

vukolic
2016-05-23 15:08
these are good pointers - will also recap here

ghaskins
2016-05-23 15:08
@jyellick: ty

vukolic
2016-05-23 15:09
for 1) periodic leader rotation: it addresses a lagging replica in a case when it is one of at most f replicas that complains about the leader and issues a view change. Per PBFT protocol such replica does participate in the main protocol until the leader is changed

vukolic
2016-05-23 15:09
so periodic leader rotation somehow helps unblock such replicas

vukolic
2016-05-23 15:10
https://github.com/hyperledger/fabric/issues/1120 contains more elaborate discussion - including a more invasive solution to premature lack of trust into leader (cf. SUSPECT messages mentioned there)

vukolic
2016-05-23 15:12
re 2) it is simple - instead wait for 2f+1 fastest replicas (out of 3f+1) we would wait for 2f+1 fastest AND an agressive/moderately set timeout to expire before moving on

ghaskins
2016-05-23 15:12
I just skimmed the issue, and yes, this all makes sense

vukolic
2016-05-23 15:12
if a lagging replica cannot respond within that timout it will lag - and let it lag

ghaskins
2016-05-23 15:12
without a SUSPECT protocol, the node may end up being permanently isolated

vukolic
2016-05-23 15:13
obviously this somehow limits latency to that timeout - so this is bad

vukolic
2016-05-23 15:13
but we will be able to switch the mechanism off

vukolic
2016-05-23 15:14
We avoid SUSPECTS at the moment as adding protocol msgs - is in a sense invasive

vukolic
2016-05-23 15:14
so it will be a last resort

ghaskins
2016-05-23 15:14
it strikes me that you could summarize this as the basic notion of introducing consensus to view-change itself

muralisr
2016-05-23 15:15
if a machine joins the network and is neither slow nor byzantine, will it catch up “pretty quickly” ?

jyellick
2016-05-23 15:15
Before or after the suggested implemented changes?

muralisr
2016-05-23 15:15
before

vukolic
2016-05-23 15:15
hopefully :slightly_smiling_face: - once we implement reconfiguration - that is

muralisr
2016-05-23 15:15
as is today

jyellick
2016-05-23 15:16
Recovery is driven by eavesdropping, so today, if there is no traffic, no recovery will occur.

jyellick
2016-05-23 15:16
Some of the proposed changes basically artificially generate traffic, so that recovery can occur.

muralisr
2016-05-23 15:17
if the answer is a “good machine” will likely catch up, then the effect of letting-things-be is just to be darwinistic ?

muralisr
2016-05-23 15:18
ie, go back to what @ghaskins was saying - don’t do anything, its part of the game

jyellick
2016-05-23 15:18
If the node was partitioned, and initiated a view change, its blockchain will slowly catchup, but it will not start participating in ordering/executing until a view change occurs.

jyellick
2016-05-23 15:18
(We propose periodic view changes to solve this)

muralisr
2016-05-23 15:18
ok

muralisr
2016-05-23 15:19
“node was partitioned” as in network partitioning ?

jyellick
2016-05-23 15:19
Think, ethernet cable got unplugged for a few minutes.

muralisr
2016-05-23 15:19
ok

muralisr
2016-05-23 15:19
thanks much!

jyellick
2016-05-23 15:20
You're welcome, anytime

bcbrock
2016-05-23 15:31
@ghaskins @jyellick The problem with allowing lagging nodes to lag permanently is that it obviates the need for this project. Why would a client invest in the infrastructure to add “her” node to the blockchain network, if her node is not going to be up-to-date? The database might as well become a centralized service then. “Strong reads” are another example of a solution that calls into question the need for the blockchain. I don’t want the network to wait for the slowest node, but I think all nodes need to make progress as fast as they can unless they are faulty.

ghaskins
2016-05-23 15:33
@bcbrock: from my perspective, its the opposite…you would want to use this project specifically _because_ its resistant to a slow node and a result can be validated with a strong-read.

bcbrock
2016-05-23 15:33
You can do that already

bcbrock
2016-05-23 15:34
with other distributed databases

ghaskins
2016-05-23 15:34
i have to step out for a dr appt, to be continued...

jyellick
2016-05-23 15:35
@bcbrock: Other distributed databases are not byzantine fault tolerant?

jyellick
2016-05-23 15:36
[I'm actually inclined to agree with you with respect to keeping up to date with the network, just trying to understand where everyone is coming from]

bcbrock
2016-05-23 15:57
@jyellick I’m coming from trying to understand how to explain the value proposition of joining a permissioned peer network as a peer. It seems that the current consensus system makes sense if the peer networks are not large, but composed of only a few independent players that everyone trusts, where most participants set up read-only peers that do strong reads from the trusted core. I don’t think this is how we currently explain the benefits of blockchain however, it’s all about everyone having all of the data.

muralisr
2016-05-23 15:58
@bcbrock: “ …. allowing lagging nodes to lag permanently …” - if that can, in general, happen only with byzantine nodes or slow machines., that is, it is not “typical” scenario, I’d think that’s also subsumed by the model ? ie, nodes that lag behind deserve to lag behind….

kostas
2016-05-23 15:59
@bcbrock: I would replace "everyone trusts" with "limited trust", and we make up for that lack of trust by checking for Byzantine faults

jyellick
2016-05-23 16:05
@muralisr: I think the difference here is that your machine doesn't have to be "slow" to be left behind. In general, the f slowest replicas will constantly be lagging, even if they are 99% as fast as the 2f+1st fastest replica, whenever the system is under high load.

muralisr
2016-05-23 16:06
ah true. understood…(thanks @jyellick !)

muralisr
2016-05-23 16:16
interesting, its like this discussion. If its slow enough, I can catchup :slightly_smiling_face: …. if that analogy holds, isn;t it a matter of flow control ?

cca
2016-05-23 16:33
Interesting & good discussion here. My 2c: @chetsky - Ensemble would be the wrong place to look for a solution because there, if a node does not respond after a timeout, then it is faulty in the *worst* possible way that matters, because it only considers crashes. So if the slow node times out, it can be thrown out and we do not harm the system. Also, Ensemble will reconfigure the group for this. On the other hand, in the BFT model, a slow node shows a *mild* form of fault, because it could catch up later and fill in for an actually misbehaving node. So the suggested solution makse sense to me (use a moderate timeout, tuned to the progress rate of the others, with sth like the 20% idea). The inflow control must also be adjusted to what the system can handle. @ghaskins: Operating in with much asynchrony and permit the lagging nodes will create an issue with buffering: in theory we can pretend nodes perpetually try to resend to the slow nodes, but this implies unbounded buffers, either at the sender or in a communication channel like TCP. None of them has such unbounded memory in practice though. Thats why the proposed solutions are needed (periodic view change, gossip, and so on).

ghaskins
2016-05-23 17:07
@bcbrock: back

ghaskins
2016-05-23 17:08
so, you were wondering what is the value proposition if you were to go through the trouble of installing a VP only to have it be summarily ignored by the network IIUC

ghaskins
2016-05-23 17:09
theres two parts to that answer

ghaskins
2016-05-23 17:10
the first part is the value proposition to _everyone else_…that is, someone bringing a subpar node/network to the cluster doesn’t take everyone else down with it

ghaskins
2016-05-23 17:14
the second is to recognize that “validation” is actually a multi faceted beast: one part is the part we often talk the most about…the notion of computing a signature for a given transaction juxtaposed against a specific world state….the second part is about validating the signatures of all the participating validating peers

ghaskins
2016-05-23 17:15
any VP has the ability to participate in the first part, but we only need a certain subset to achieve quorum….its a best effort contribution in the hope that your VP helps the network make forward progress

ghaskins
2016-05-23 17:16
its the second part that is actually important w.r.t. the value proposition of a potential participant…the ability to verify that everything looks kosher, up to the limits of the byzantine resistance of the network.

ghaskins
2016-05-23 17:18
A slow node only loses out in the ability to help achieve quorum, it doesn’t lose out in its ability to ascertain the legitimacy of the world state

ghaskins
2016-05-23 17:19
What I am trying to say is, liveliness/real-time doesn’t matter for the most important function, it can be done offline at any time in the future.

ghaskins
2016-05-23 17:23
If, on the other hand, we are saying that all things being equal, certain nodes may never catch up, we should probably try to address that so its more even

ghaskins
2016-05-23 17:27
but if someone introduces a particularly slow node and/or connection, i have absolutely no problem with the notion that it might never contribute a signature

ghaskins
2016-05-23 17:50
@cca: understand what you are getting at: What I am saying is that at least for certain use cases, there will be a natural flow control at a higher level

ghaskins
2016-05-23 17:51
for instance, if I am doing UTXOs, im probably not going to blast a chain of 100k successive spends of the same coin..rather I am going to do certain transactions and then block for confirmation

ghaskins
2016-05-23 17:52
if confirmation is slow, my spend requests slow down

ghaskins
2016-05-23 17:52
thats all I was getting at

jyellick
2016-05-23 17:59
@ghaskins: Do you see clients querying 'the network', or, 'their validating peer'?

ghaskins
2016-05-23 18:03
Thats a tough question….we didn’t have the notion of query() like the one that exists in HLF. Everything went through consensus, which solves some problems (and creates others). When I first joined OBC I didn’t like the notion of “trusting your NVP”, at least partly because I didn’t view it as “part of the client stack”. If you do view it that way (or consume something more explicit like the upcoming node sdk), it seemed that “strong read” was the only way to go. However, I think the current mechanism for query() is not really conducive to strong reads so I am not sure we have much of a choice

ghaskins
2016-05-23 18:05
we would either need to be able to specify the block height in the request (which may have implications for clients that ask for sufficiently old blocks), or to include block height in the response so that clients could tell when they get a stale answer (as opposed to a byzantine answer)

ghaskins
2016-05-23 18:06
I do like what you were getting at the other day with the notion that we can probably “push” some kind of synchronizing signal w.r.t. “current” rather than requiring remote end points to try to request it

ghaskins
2016-05-23 18:06
i think that is part of the story here.

jyellick
2016-05-23 18:07
Yes, there seem to be two camps to some extent. The "You can't possibly trust the result unless it's a strong read, which goes to the whole network" and then the other side of "So long as I know my copy of the blockchain is correct, there's no reason to go to the network"

kostas
2016-05-23 18:08
don't checkpoints play that role of the synchronizing signal?

jyellick
2016-05-23 18:08
That is one proposal

jyellick
2016-05-23 18:08
(And one I like)

ghaskins
2016-05-23 18:10
just realized my first statement above was incoherent (sorry, just had my eyes dilated and cant see very clearly right now, heh)

ghaskins
2016-05-23 18:10
what I meant was, if you can get to “i trust my stack”, then a weak read is ok

ghaskins
2016-05-23 18:11
but I do think that “trusting your stack” really means that you own an NVP-like representation of the chain locally

jyellick
2016-05-23 18:11
The sort of fundamental architecture question I have is whether clients get to talk to the whole network or not. As I've heard it described before, you have 4 companies, each running a validating peer, and of course, company A isn't going to let clients from company B query its validating peer, because of firewalls and generally because they do not want to pay to support their queries (maybe company B runs 100x the queries of company A).

ghaskins
2016-05-23 18:13
I have always envisioned they could, but i see your point

ghaskins
2016-05-23 18:13
let me think about it some more

jyellick
2016-05-23 18:13
And maybe it's fine that they can, but if they cannot, it changes the implementation considerably

ghaskins
2016-05-23 18:14
agreed

jyellick
2016-05-23 18:15
I think in the bitcoin world, they would say "It's totally unacceptable to have to go ask the network, I need to be able to trust queries against my node". Of course I think that's a little disingenuous, you still need information from the network, like the current block height, to actually trust your node, but I still understand their concern that doing strong reads is very much a different paradigm. It would be nice to support both.

kostas
2016-05-23 18:16
@ghaskins: as long as you have f+1 checkpoints for block X (which is a process that happens during PBFT), your local chain is solid up to that point, so a weak read is good. checkpoint messages keep coming periodically and harden your local chain up to block Y (Y > X), so eventually you get a longer chain you can trust. do you see any drawbacks to that?

jyellick
2016-05-23 18:17
@kostas There is the censorship problem, but I think with the 'time, crontab' stuff @muralisr is working on, you would be able to detect that

kostas
2016-05-23 18:17
censorship from whom specifically?

jyellick
2016-05-23 18:18
A VP censoring updates to a client. Censorship would be the malicious form, but you could simply go with 'staleness' as a more general term.

kostas
2016-05-23 18:20
correct, but not an unfixable problem. in the end, unless you're partitioned (at which case you have bigger problems), you should be able to get those checkpoints.

ghaskins
2016-05-23 18:21
@kostas: when I first joined the project, I was bothered by that model because I saw the local NVP as part of “the network” and not part of “the client”. I think when we start to talk about having the NVP be part of the client figuratively (or literally via the nodesdk) then it becomes more practical. I think the important distinction is when looking at “the trust line”. On the trusted side of the trust line, single points of reference are “ok”. On the untrusted side, you have to “fan out” or “strong read”….now by strong read, I dont mean necessarily w.r.t. transactions/queries…but rather just the notion that multiple points of reference are considered….this would include consensus/ledger level comms

ghaskins
2016-05-23 18:21
bridging the trust line must consider the entire network state, IOW

kostas
2016-05-23 18:22
agreed

ghaskins
2016-05-23 18:23
lunch time, bbiab

vladimir.starostenkov
2016-05-23 20:25
has joined #fabric-consensus-dev

vukolic
2016-05-24 08:12

vukolic
2016-05-24 08:13
they might as well be - but this is not immediate/obvious

simon
2016-05-24 08:33
@muralisr: you are working "time, crontab"? what's this?

muralisr
2016-05-24 11:27
@simon : more like “playing with”. A time service using consensus based on @chetsky’s idea (a while back)

simon
2016-05-24 11:32
ah

simon
2016-05-24 11:32
how does it work?

muralisr
2016-05-24 11:39
basically a service will create “update” time transactions periodically (at some granularity of seconds, say 1 second) on a system chaincode. If everyone consents based on accuracy within limits on their local clock time, that will serve as a global time. The time has to be coarse grained ( not msecs for example)….

simon
2016-05-24 11:39
everyone?

muralisr
2016-05-24 11:39
the devil will be in the details :slightly_smiling_face: but it is an interesting approach

simon
2016-05-24 11:39
or 2f+1?

muralisr
2016-05-24 11:39
2f+1

simon
2016-05-24 11:39
how do you tie in the validation?

simon
2016-05-24 11:40
what happens if the service stops updating the time?

jamie.steiner
2016-05-24 11:40
Guardtime, my company, has a product that does exactly that (among other things)

jamie.steiner
2016-05-24 11:41
and can be cryptographically verified out of band

muralisr
2016-05-24 11:41
there could be copies of services

jamie.steiner
2016-05-24 11:42
if that is a requirement, I would suggest we may have a ready solution

simon
2016-05-24 11:42
interesting - how would that go into hyperledger?

simon
2016-05-24 11:43
muralisr: i'm working on periodic null requests right now

muralisr
2016-05-24 11:44
@simon: goroutine(s) built into the fabric for initiating transactions and a system chaincode

simon
2016-05-24 11:44
muralisr: that was directed to jamie.steiner

simon
2016-05-24 11:45
muralisr: i'm wondering - what about all consensus nodes attaching their idea of time to packets

simon
2016-05-24 11:45
(with signature)

muralisr
2016-05-24 11:46
@jamie.steiner: there’s an issue that covers this broadly (@simone do you have the issue you created handy ?)

simon
2016-05-24 11:46
and then the leader compiles a list of times, which allows bounding the idea of time as "somewhere between this and that value"

jamie.steiner
2016-05-24 11:46
access is available via an http service

simon
2016-05-24 11:46
although it is probably better not to attach it to the consensus service directly

jamie.steiner
2016-05-24 11:47
we have a developer program which could be used to evaluate it: https://guardtime.com/blockchain-developers

jamie.steiner
2016-05-24 11:47
happy to get some of our developers to assist - @ristoalas would be a good resource

simon
2016-05-24 11:47
jamie.steiner: i don't think that would work - the time keeping component needs to be part of hyperledger

ristoalas
2016-05-24 11:48
has joined #fabric-consensus-dev

simon
2016-05-24 11:48
muralisr: i suggested that peers compete in proposing the next time

muralisr
2016-05-24 11:49
@simon yes. That’s also an option… the time service would be on every peer

jamie.steiner
2016-05-24 11:49
I would have thought time is the kind of thing an external oracle would be useful for

simon
2016-05-24 11:49
but then, do we want to record thousands of transactions a day that just update the time?

muralisr
2016-05-24 11:49
exactly

simon
2016-05-24 11:49
jamie.steiner: i don't think we should build a generic platform that is tied to one company's service

muralisr
2016-05-24 11:49
and in the end we also need this to work closely with crontab

simon
2016-05-24 11:50
what if your company disappears - suddenly all hyperledger blockchains stop working

muralisr
2016-05-24 11:50
loosely couple approach via a system chaincode helps us at least play with this

simon
2016-05-24 11:50
muralisr: what is crontab in this scenario?

muralisr
2016-05-24 11:51
basically allows users to specifiy “run tx-A at this time"

muralisr
2016-05-24 11:51
think timer-wheel going over transactions to run

muralisr
2016-05-24 11:52
triggers

simon
2016-05-24 11:52
and who triggers the transaction?

simon
2016-05-24 11:52
which peer

muralisr
2016-05-24 11:52
same thing as today. The peer to which the transaction is submitted will hold the crontab entry

jamie.steiner
2016-05-24 11:52
as far as I am aware, a number of functions are designed to be pluggable, based on the deployment and the requirements. It seems like one option could be to offer an external oracle for time. Even the base consensus algorithm is more-or-less pluggable, right?

simon
2016-05-24 11:53
muralisr: so the local peer

muralisr
2016-05-24 11:53
correct

jamie.steiner
2016-05-24 11:53
im not trying to create a dependency, just offering our expertise.

simon
2016-05-24 11:53
jamie.steiner: the aspiration is to make it pluggable

simon
2016-05-24 11:54
jamie.steiner: the difficulty is not the external oracle, but trusting its data and the way it has been introduced into the system

simon
2016-05-24 11:54
jamie.steiner: otherwise we could just use ntp

simon
2016-05-24 11:54
muralisr: but that notion of time would be independent of the chaincode's notion of time

simon
2016-05-24 11:55
muralisr: what happens if that peer goes down or suffers network outage, etc?

muralisr
2016-05-24 11:56
chaincode would have to have an API to on the shim to get the time

jamie.steiner
2016-05-24 11:56
it's not similar to using ntp - in this case it would function as a TTS that does not require trust in a root certificate.

jamie.steiner
2016-05-24 11:57
happy to discuss offline, if you are interested in evaluating the approach.

simon
2016-05-24 11:58
why offline?

simon
2016-05-24 11:58
this is perfect

simon
2016-05-24 11:58
the problem is that even if time comes from one oracle, how do you know that you can trust the entity that took the time from the oracle, and gave you the right value?

muralisr
2016-05-24 11:58
and the time service would be across validators ( skewed on a random sleep )

jamie.steiner
2016-05-24 11:58
here is fine as well. will have to come back after a while though, i have a call

simon
2016-05-24 11:59
it could have delayed the message for an arbitrary time

jamie.steiner
2016-05-24 11:59
a particular message, is submitted to our service, gets a time associated with it that is universal.

muralisr
2016-05-24 12:00
in the end, the idea is to have something loosely coupled and easy enough to work with for everyone

muralisr
2016-05-24 12:00
built upon and using fabric’s mechanisms such as consensus

simon
2016-05-24 12:00
i'm interested in the specific implementation

simon
2016-05-24 12:01
so the crontab is just a "submit this transaction after time X"

simon
2016-05-24 12:01
that's the equivalent to the bitcoin mempool, in a way

muralisr
2016-05-24 12:01
“around time X” is more accurate I think

simon
2016-05-24 12:01
what does "around" mean?

muralisr
2016-05-24 12:02
won’t be at 11.01.22.536 exactly … but between 11.02.22 and 11.02.24

simon
2016-05-24 12:02
so after 11.02.22

muralisr
2016-05-24 12:03
within tolerance

simon
2016-05-24 12:03
that's too vague for me

muralisr
2016-05-24 12:05
really ? well, tolerance is just so we build the system to be not too finegrained. Idea being that in the end the execution of transactions cannot be accurately set for time, why make it a requirement that it should be initiated at an exact time

muralisr
2016-05-24 12:06
the crontab has to do work and the tolerance just gives us the room to do it

simon
2016-05-24 12:08
i didn't say initiated at an exact time

simon
2016-05-24 12:08
i said after some specific time

muralisr
2016-05-24 12:09
right

simon
2016-05-24 12:09
also do you want to trigger the transaction based on real world time, or based on agreed-upon chaincode time?

simon
2016-05-24 12:09
and why

muralisr
2016-05-24 12:14
you mean the “crontab” transaction ?

muralisr
2016-05-24 12:15
that’s just a service system provides so users can initiated a transaction based on system time

simon
2016-05-24 12:16
just local time

simon
2016-05-24 12:16
ok

muralisr
2016-05-24 12:20
again this is just something to try out as the implementation is not too hard

simon
2016-05-24 12:29
try out is bad

simon
2016-05-24 12:29
because it doesn't consider all issues that can arise

muralisr
2016-05-24 12:31
obviously this is not for merge into mainline

simon
2016-05-24 12:32
sure, but what i'm saying is that this needs a requirements doc, etc.

simon
2016-05-24 12:32
e.g. what happens if you submit a request, and the peer crashes and restarts?

muralisr
2016-05-24 12:36
correct. Theres a persistent component to crontab.

muralisr
2016-05-24 12:43
as for requirements doc, definitely. The implement is to test the mechanics fairly quickly and fail fast.

simon
2016-05-24 13:49
problem with consensus and security is that you don't fail fast :slightly_smiling_face:

muralisr
2016-05-24 13:50
haha..you could be right

jamie.steiner
2016-05-24 14:32
@simon: "the problem is that even if time comes from one oracle, how do you know that you can trust the entity that took the time from the oracle, and gave you the right value?" can you elaborate? is the problem that you cant trust the accuracy of the time, or that you cant trust that the entity took some action at that particular moment?

simon
2016-05-24 14:32
either

jamie.steiner
2016-05-24 14:39
I realize that our solution is specific to my company's service, but I can explain how we solve that problem. I understand you may feel there is a reason to make a time stamping component that is intrinsic to the hyperledger stack. If, however, there is a possibility to use an external source as an oracle for time, our service would serve well in that role. Specifically, our blockchain creates one block every second, verifiable through public observation that there are 3600 blocks every hour, etc. Any piece of data or action that can be represented as data can be signed, and it's proof of existence at that point in time can be proven. The trust anchor is widely witnessed evidence that is periodically published in newspapers. the time can be backed out from that.

simon
2016-05-24 14:41
but only for the past

simon
2016-05-24 14:41
not for the current time

simon
2016-05-24 14:42
i can not be sure that the time you just published is actually timely - you might have delayed publication by a some time

simon
2016-05-24 14:42
but, for a moment, let's assume that you act honestly

simon
2016-05-24 14:43
whatever node that introduces your timestamp into the hyperledger blockchain might just delay your honest timestamp arbitrarily

jamie.steiner
2016-05-24 14:44
if you are assuming that our service acts honestly, and I present you a signature that relates to a particular time - you must agree agree that this time has passed.

simon
2016-05-24 14:45
no

simon
2016-05-24 14:45
that *at least* this time has passed

simon
2016-05-24 14:45
but it could be a day later in fact

simon
2016-05-24 14:45
and it is a stale piece of information

jamie.steiner
2016-05-24 14:47
sure, but if other timestamps from moments after that are known, then the stale nature is obvious. it seems trivial to sign a piece of data every second and avoid this.

simon
2016-05-24 14:49
but all of these signed timestamps could be delayed by a day

simon
2016-05-24 14:49
nobody can tell the difference

jamie.steiner
2016-05-24 14:54
anyone who has access to the service can get a fresh timestamp and see that it returns in a second. I'm not sure i understand how it can be delayed.

simon
2016-05-24 14:56
chaincode cannot access the network

simon
2016-05-24 14:56
so this needs to be integrated into the hyperledger code

simon
2016-05-24 14:57
but different replicas will receive the request at different times

simon
2016-05-24 14:57
at that point, the replicas might as well just look at their local clock

jamie.steiner
2016-05-24 14:59
hmm, I see. so the real issue is that the notion of time has to be local to the chaincode, not the node that is executing the chaincode? I dont agree that the local clock is as good - there is still the notion of an outside, impartial proof of what time is. It is separate from, and independent from the local clock in a useful way.

bcbrock
2016-05-24 15:00
@jyellick Jason, regarding PR #1557, I apologize, I didn’t realize the extent of the problem. I had assumed that the choice was between “stale” and “up-to-date” values. Of course the query should never return garbage.

simon
2016-05-24 15:00
jamie.steiner: but ntp is also impartial

simon
2016-05-24 15:01
bcbrock: that's #1091

jyellick
2016-05-24 15:01
@bcbrock: Not a problem, it's great to have interest to get these things fleshed out, I completely agree we need a better way to communicate this through the API

simon
2016-05-24 15:01
we totally have to

simon
2016-05-24 15:01
it's a long standing problem

jamie.steiner
2016-05-24 15:01
its accuracy is not independently provable, and an attestation of time by ntp cannot be transferred or verified by a third party

simon
2016-05-24 15:02
ntp's accuracy is as provable as your accuracy

simon
2016-05-24 15:03
you say "look at our past performance, we've been working correctly"

jamie.steiner
2016-05-24 15:03
I disagree. you can choose to trust it locally, if you choose, but you cannot later explain to a third party why you chose that. an ntp timestamp is just a piece of data

simon
2016-05-24 15:03
but that doesn't mean that the current timestamp is not wrong

simon
2016-05-24 15:03
that is correct

simon
2016-05-24 15:04
but for a byzantine fault tolerant network, trusting one external oracle is silly anyways

jamie.steiner
2016-05-24 15:05
inevitably, it has to connect to external events.

jamie.steiner
2016-05-24 15:05
and oracles will be required.

simon
2016-05-24 15:06
well if i trust an oracle

jamie.steiner
2016-05-24 15:06
if not for time, then for what LIBOR is, or whether company XYZ defaulted.

simon
2016-05-24 15:06
why don't i just run all the code there

simon
2016-05-24 15:06
then i can skip the whole byzantine threat model

jamie.steiner
2016-05-24 15:07
I think the problem of trusting code execution is separate from the problem of how a ledger connects to external events.

simon
2016-05-24 15:08
sure

simon
2016-05-24 15:08
the fundamental question is, how do we trust the data that is input

jamie.steiner
2016-05-24 15:09
if your view is that every trusted event that impacts the state of the ledger must be generated within the ledger, I believe the scope of what can be accomplished is much limited.

simon
2016-05-24 15:09
and our answer seems to be "if enough peers can validate hat the input reflects (approximately) its real value, then it is accepted"

jamie.steiner
2016-05-24 15:09
perhaps time is not the best example - im sure you can implement some method to come to consensus around what the time is.

simon
2016-05-24 15:10
another answer is "if an external trusted entity certifies the fact"

simon
2016-05-24 15:10
i.e. you for time, the FED for interest rates, etc.

jamie.steiner
2016-05-24 15:12
the devil is in the details - for example, what " (approximately)" means for time seems fairly agreeable - it might be harder for LIBOR, but certainly the logic will be different, and the tolerance for different type of data is likely to be hotly contested.

simon
2016-05-24 15:13
correct

simon
2016-05-24 15:13
i suggested creating a framework for consenting on external data long ago

simon
2016-05-24 15:13
but it was not considered the right approach, i guess

simon
2016-05-24 15:16
@jyellick, @vukolic: i'm thinking about how to do periodic null requests, and they are more complicated than you'd think

simon
2016-05-24 15:17
sending a null request is simple, but when do you send it? or rather, when do you expect that the primary sent one?

simon
2016-05-24 15:17
do you look at when you receive a pre-prepare?

simon
2016-05-24 15:17
or when the request commits or executes

simon
2016-05-24 15:18
i guess because of primary, pre-prepare.

ghaskins
2016-05-24 15:19
@simon: I fully agree with you. 1) oracles are not useful here, and 2) I think we _can_ create a framework for defining consensus on external events (and think, in fact, we have to)

ghaskins
2016-05-24 15:20
this is part of what I was driving at with my comments in https://github.com/hyperledger/fabric/pull/1513

ghaskins
2016-05-24 15:21
For instance, to use system-chaincode as the closest thing to approximate what could be part of the an external-event-framework, I envision these system chaincodes would need to be able to invoke transactions on other chaincode without a tcert context

ghaskins
2016-05-24 15:21
for instance, to emit a time event

simon
2016-05-24 15:23
external data is just a different name for non-determinism

simon
2016-05-24 15:23
and it will be real difficult to do that

jyellick
2016-05-24 15:23
@simon: I would think it should be keyed off the commit. The network is configured to send a null request pre-prepare one second after the last commit if no new requests have been received. Then, for the byzantine check, the backups would have some slightly longer timer, say, 2s for how long to allow between completing an execution, and receiving a commit cert for the next execution.

simon
2016-05-24 15:24
but that's way in the future

simon
2016-05-24 15:24
jyellick: but the primary only controls pre-prepares, not commits

simon
2016-05-24 15:25
but i'm glad you disagree - it's not obvious

jamie.steiner
2016-05-24 15:25
i agree that where this conversation ended is largely very theoretical. I do not believe it will be possible to generate all required data internally

jyellick
2016-05-24 15:26
I'm not convinced that pre-prepare is wrong, but once a pre-prepare has been broadcast (non-byzantinely) we should be guaranteed to get that commit cert.

jyellick
2016-05-24 15:27
I was thinking queue off of commit because its existence is evidence to the fact that the pre-prepare was broadcast in that non-byzantine way

jyellick
2016-05-24 15:29
As I think about it more, handling the timer off of pre-prepare seems safe, I think they should be equivalent, and off of 'pre-prepare' is more fair to the primary, as only its latency to send to the backups is counted against it, not the network latency.

simon
2016-05-24 15:30
yea

simon
2016-05-24 15:30
thanks

simon
2016-05-24 15:30
i'll try to implement that

simon
2016-05-24 15:32
jyellick: oh, it's all not so easy

simon
2016-05-24 15:32
what happens if there are no free sequence numbers?

simon
2016-05-24 15:32
so maybe commit is better

jyellick
2016-05-24 15:33
So I think the timer definitely needs to start after execution. The big thing we cannot do is include the execution time and count it against the 'null timeout'

jyellick
2016-05-24 15:34
@simon: I've started prototyping converting `pbft-core.go` to be more state-machine-y, trying to do it in a PR friendly way in small chunks. Would like to talk about it when you have some time.

simon
2016-05-24 15:34
yea

simon
2016-05-24 15:34
what's your plan?

jyellick
2016-05-24 15:37
So, essentially, there would be an 'event manager' who's simple task is to have an unbuffered channel, which events get delivered to an event receiver interface whose definition simply requires `processEvent(event interface{}) interface{}`, if `processEvent` returns something that is non-nil, it is treated as a new priority event to be processed next. So for instance `pbft-core.go` becomes a `eventReceiver`, and its `processEvent` looks like: ``` func (instance *pbftCore) processEvent(event interface{}) { logger.Debug("Replica %d processing event", instance.id) switch ev := event.(type) { case viewChangeTimerEvent: http://logger.Info("Replica %d view change timer expired, sending view change", instance.id) instance.sendViewChange() case messageEvent: msg := ev logger.Debug("Replica %d received incoming message from %v", instance.id, msg.sender) instance.recvMsg(msg.msg, msg.sender) case stateUpdatingEvent: update := ev instance.skipInProgress = true instance.lastExec = update.seqNo instance.moveWatermarks(instance.lastExec) // The watermark movement handles moving this to a checkpoint boundary ... case execDoneEvent: instance.execDoneSync() default: logger.Error("Replica %d received an unknown message type", instance.id) } } ```

jyellick
2016-05-24 15:38
Then for the plugins, they would also be `eventReciver`s, and they would first process the event, then pass it into PBFT core if they chose to

simon
2016-05-24 15:39
why the return?

jyellick
2016-05-24 15:39
My concern is overflowing the stack

simon
2016-05-24 15:39
so that there is a formalized way?

simon
2016-05-24 15:39
oh, go doesn't do tail calls?

jyellick
2016-05-24 15:39
Not to my knowledge

simon
2016-05-24 15:40
usually compilers do these days

jyellick
2016-05-24 15:40
(But I could be wrong)

simon
2016-05-24 15:40
i would implement whatever is easier

simon
2016-05-24 15:40
however, there needs to be an interface to enqueue events (timer events)

jyellick
2016-05-24 15:40
But generally, I dislike that at the end of `sendViewChange` we call `processNewView`, it seems like it would be clearer if we just injected that as an event

jyellick
2016-05-24 15:40
Yes, I've also worked a little on that

simon
2016-05-24 15:41
timers need to be first class objects supported by the event manager

jyellick
2016-05-24 15:41
The key is timer events need to be cancel-able, and I think I've got that down

simon
2016-05-24 15:41
well, cancelling events is simple

simon
2016-05-24 15:41
you just set a field `cancelled` to `true`

jyellick
2016-05-24 15:42
It depends on your implementation but isn't necessarily quite that easy, you have the race in that case

simon
2016-05-24 15:42
the event manager will have to maintain its own timer wheel, and pick the next event from that wheel, and post it to `processEvent`

jyellick
2016-05-24 15:42
``` if !cancelled { sendEvent() } ``` What if it's canceled after it enters into the (blocking) `sendEvent()` call

simon
2016-05-24 15:43
who could cancel it?

simon
2016-05-24 15:43
only `processEvent` can cancel

jyellick
2016-05-24 15:44
So, classic case, we're executing a request, and the view change timer expires, then we cancel the view change timer because the execution finished, and we've still got this view change event waiting for us.

simon
2016-05-24 15:44
well no

simon
2016-05-24 15:45
because the event manager will go and wait on `newMessage() || timerExpired()`

simon
2016-05-24 15:45
it will get the timerExpired, but then sees that the timer got cancelled

simon
2016-05-24 15:45
so it discards it

jyellick
2016-05-24 15:46
Why not, simply have the timer not send the event if it is reset before the event is read?

simon
2016-05-24 15:46
or that

simon
2016-05-24 15:46
it's all internal to the event manager

jyellick
2016-05-24 15:47
Right. I liked the idea of pushing the complexity of 'not sending canceled timers' into the timer, rather than into event delivery, but yes, the key is they both happen in the event manager.

simon
2016-05-24 15:48
so how do we ingress events?

simon
2016-05-24 15:48
does the public API just enqueue events to the event manager?

jyellick
2016-05-24 15:49
I say the event manager has a thread, the only thread which touches any state, and the public API will simply write events onto the unbuffered channel that event thread reads from

jyellick
2016-05-24 15:49
(So, generally, the public API calls would block until the event is delivered)

simon
2016-05-24 15:49
yes

simon
2016-05-24 15:51
so the event manager has two sources of events, its internal timer thing, and "incoming" event (channel), which is written to by the public API

jyellick
2016-05-24 15:52
Right. The channel is of type `interface{}` so you can write whatever event type you want to onto it, which makes it nicely pluggable, then on the other side, the type switch figures out what event it is, and gives you whatever event metadata is required

jyellick
2016-05-24 15:53
I did some brief research, and doing switching based on the type like that is (according to stack overflow) 4-5 times slower than switching on an int, but that still seems plenty fast, the switching is not likely to be our performance bottleneck, and it could be re-written as an int type switch later if we really needed to.

simon
2016-05-24 15:53
how does that integrate with timers?

simon
2016-05-24 15:53
i like the way you're switching

simon
2016-05-24 15:54
although an alternative would be to enqueue funcs

simon
2016-05-24 15:54
and just execute them

jyellick
2016-05-24 15:54
Yes, I considered the func queue, but this seemed less invasive to the current code, and not obviously worse

simon
2016-05-24 15:54
`chan <- func(){ op.doFoo(arg) }`

jyellick
2016-05-24 15:54
The timer would simply be another type which would get shoved onto the channel. I figured the manager could provide an interface for registering timer types

simon
2016-05-24 15:55
wait, onto the channel?

simon
2016-05-24 15:55
i don't know whether i like that channel API

simon
2016-05-24 15:56
i think a `eventManager.Queue(...)` and `eventManager.Timer(duration, ...)` would be more explicit and symmetrical

simon
2016-05-24 15:57
for testing, we would implement a different eventmanager, i guess

simon
2016-05-24 15:58
where timers just don't have any duration, but execute when there is nothing else happening

simon
2016-05-24 15:58
of course the timer events would still be ordered properly

simon
2016-05-24 15:59
the more i think about the null requests, the less happy i am about them

simon
2016-05-24 15:59
watermarks only update when checkpoints are reached

simon
2016-05-24 16:00
so that binds in execution

simon
2016-05-24 16:01
i don't think null requests are a clear and simple solution

jyellick
2016-05-24 16:08
The channel would not be exposed to the outside world, only internal to the event manager, but because the timers are contained in the event manager, they are free to interact with it directly (which, I just showed @kostas, makes for some pretty clean code). We could have an API on the event manager so that someone like `RecvMsg` could write `manager.Queue(event interface{})` which then blocks on a channel write, or we could have the API call write to the channel directly. The channel is more flexible, but the API call is maybe more approachable.

ghaskins
2016-05-24 16:10
@simon: you mentioned external data is non deterministic and thats a problem….I would argue, thats the point

jyellick
2016-05-24 16:10
As to null requests, the eventual target would be to actually update watermarks based on sequence numbers from non-checkpoints. I discussed with @kostas this weekend on why the original implementation used checkpoints, but there's nothing preventing it.

ghaskins
2016-05-24 16:11
we want it to go through consensus…

ghaskins
2016-05-24 16:11
if 7 out of 10 nodes agree it is at least May 24 2016 UTC, then it is at least May 24 2016, otherwise it isnt

ghaskins
2016-05-24 16:12
non-deterministic events will never be legitimized…we just need to make sure the system can handle the possibility of their introduction

ghaskins
2016-05-24 16:15
to be clear, the framework would only support the confirmation of events that have some semblance of determinism to them….the passing of a date, the delivery of a package, the current interest rate by the FED, etc…its the job of the framework to associate consensus around that, not to legitimize random stuff being injected

simon
2016-05-24 16:23
jyellick: so what's the api of the event manager?

simon
2016-05-24 16:24
is it `Queue(interface{}), Timer(duration, interface{}) someTimerObj), Cancel(someTimerObjType)`?

jyellick
2016-05-24 16:26
@simon: Not yet finalized, but I'm not certain there needs to be anything other than `Queue`, you can create the timer with a reference to the event manager, and then use the timer's `start` and `stop` methods.

simon
2016-05-24 16:27
no

jyellick
2016-05-24 16:28
The actual 'event wheel' type stuff could be implemented separately, as you have in custodian

simon
2016-05-24 16:28
for tests the timers must be non-timers

jyellick
2016-05-24 16:28
You could have the manager act as a timer factory if you wanted

simon
2016-05-24 16:28
the event wheel must be in the event manager

jyellick
2016-05-24 16:28
I don't see why

jyellick
2016-05-24 16:29
You have an event wheel in custodian, which seems to work fine, it just needs to not have a go routine

simon
2016-05-24 16:29
because otherwise, how can the event manager decide whether to service a timer or an event?

jyellick
2016-05-24 16:29
Because it only gets the timer event if the event is not canceled

simon
2016-05-24 16:29
so the event manager internally uses something that implements a timer wheel?

jyellick
2016-05-24 16:30
Why do you need a timer wheel? Why not simply atomically get timer events, if you receiver a timer event, then it is valid.

simon
2016-05-24 16:30
who produces the timer events?

jyellick
2016-05-24 16:30
The event timer, which has a reference to the event manager

jyellick
2016-05-24 16:31
(so that it can atomically deliver events)

simon
2016-05-24 16:31
and how is that event timer implemented?

jyellick
2016-05-24 16:31
As a small state machine in a select statement

simon
2016-05-24 16:31
and where is it implemented?

jyellick
2016-05-24 16:31
in the `eventTimer` struct

simon
2016-05-24 16:32
i mean the code

simon
2016-05-24 16:32
of that state machine

jyellick
2016-05-24 16:33
The event timer would have a go routine which is created when it is constructed, that go routine would be responsible for servicing the events of 'start' 'stop' 'timer expired' and 'deliver event'.

jyellick
2016-05-24 16:33
The atomicity comes from the fact that the start, stop, and deliver events are all bound to the event manager thread

simon
2016-05-24 16:34
how would that work in tests?

simon
2016-05-24 16:34
i.e. how do we make tests deterministic?

simon
2016-05-24 16:34
use a different timer implementation?

jyellick
2016-05-24 16:35
Yes, I think that would make most sense, have a timer factory of some sort so we can override the timer implementation in our unit tests

simon
2016-05-24 16:35
so then we might put that into the event manager

jyellick
2016-05-24 16:35
We could

jyellick
2016-05-24 16:36
Since we would be supplying a different event manager for unit tests

jyellick
2016-05-24 16:36
It would be a natural place to do it

simon
2016-05-24 16:36
yes

simon
2016-05-24 16:36
i don't see any benefit of using a go routine per timer, vs a timer wheel for all timers

jyellick
2016-05-24 16:38
The event manager in the non-unit test implementation becomes much simpler, and it allows us to in the near term retain the majority of our existing code

jyellick
2016-05-24 16:38
I don't see any particular problem with a go routine per timer

simon
2016-05-24 16:39
well, it needs more resources.

jyellick
2016-05-24 16:39
The overhead of a go routine is pretty minimal, especially since we're talking about two or three of them

simon
2016-05-24 16:39
well, custodian

simon
2016-05-24 16:39
potentially hundreds

jyellick
2016-05-24 16:39
No, not at all

simon
2016-05-24 16:40
the advantage of a timer wheel is that event order is determined on enqueue

jyellick
2016-05-24 16:40
For custodian, who already implements a timer wheel, he would simply keep his implementation, and use a single timer to trigger processing

simon
2016-05-24 16:40
and events don't race delivery

jyellick
2016-05-24 16:40
For custodian, that works because all of your timeouts are the same

jyellick
2016-05-24 16:40
Or rather, your queuing logic is simple because of that

simon
2016-05-24 16:40
yes

simon
2016-05-24 16:41
basically i'm proposing a completely deterministic system

simon
2016-05-24 16:41
with go routines, you get events delivered non-deterministically

jyellick
2016-05-24 16:42
You're proposing a completely deterministic set of timers, we still get non-determinism on timer vs. msg for instance

simon
2016-05-24 16:42
yes we do

simon
2016-05-24 16:42
but that we can't control

simon
2016-05-24 16:43
i think implementing the timers with goroutines is fine for now, because it reduces the amount of change

simon
2016-05-24 16:43
but eventually, i'd like to replace the timers with a deterministic timer wheel

jyellick
2016-05-24 16:44
I think there's nothing prohibiting that, and in fact, I think we'd basically just move the select statement out of the timer and into the event manager

jyellick
2016-05-24 16:44
(with a few minor modifications of course)

simon
2016-05-24 16:44
yea

simon
2016-05-24 16:44
okay, i gotta take off

jyellick
2016-05-24 16:44
Alright, thanks for the discussion

simon
2016-05-24 16:45
if you push intermediate code to your repo, i'll be able to review it at some point

jyellick
2016-05-24 16:46
I'm really trying to keep these changesets small so that we can meaningfully do them as PRs

simon
2016-05-24 16:46
yea

jyellick
2016-05-24 16:47
Hopefully I can push it as a PR for public review/comment (so that we don't have to do the private repo review)

muralisr
2016-05-24 16:47
@simon, @jyellick : haven’t paid attention… saw go routines and timers.. will this help stabilize tests so timing is not an issue ?

simon
2016-05-24 16:47
yes

muralisr
2016-05-24 16:47
ok

muralisr
2016-05-24 16:47
thanks

simon
2016-05-24 16:47
once tests don't have goroutines, all will be better

jyellick
2016-05-24 16:47
@muralisr: Yes, the biggest source of instability in our tests is racing go routines, we're trying to kill them off

muralisr
2016-05-24 16:48
ok

jyellick
2016-05-25 15:04
@simon: If you could take a look at the PR for the event stuff we discussed, I'd appreciate it https://github.com/hyperledger/fabric/pull/1586

simon
2016-05-27 09:48
So it is not clear how to trigger a regular view change, because there are multiple outstanding requests at every time. Maybe together with checkpoints?

tuand
2016-05-27 13:01
so #756, PR #1623 ... how should we proceed ? at first glance, do we move all that logic inside the plugin ? get a peerconnected/disconnected event and keep track inside consensus ?

charles-cai
2016-05-27 13:30
has joined #fabric-consensus-dev

simon
2016-05-27 13:43
yes, i think that would be best

simon
2016-05-27 13:44
like the RecvMsg() interface, just with PeerConnected() and PeerDisconnected()

simon
2016-05-27 13:44
or, alternatively, PeerEvent(connected/disconnected)

simon
2016-05-27 13:44
do you remember why we need to wait for enough peers to connect before we start consensus?

jyellick
2016-05-27 13:45
So that we can establish a total ordering of the replicas

simon
2016-05-27 13:45
ah yes

kostas
2016-05-27 13:45
you don't know their IDs in advance...

simon
2016-05-27 13:45
so that is only required once

simon
2016-05-27 13:45
that's good

kostas
2016-05-27 13:45
correct

tuand
2016-05-27 13:45
yes, we need to establish all the replicaIDs before doing anything

simon
2016-05-27 13:45
eventually we can also move `Broadcast` into the consensus

simon
2016-05-27 13:46
but small steps at a time

tuand
2016-05-27 13:47
so i noticed that we still broadcast to all peers ... i'll change to broadcast to all validating peers

tuand
2016-05-27 13:48
i'll try to refactor on top of @jyellick event manager ? jason, do you have a branch i can work off of ?

jyellick
2016-05-27 13:48
pbft-state-machine-pr2 includes the latest changes

tuand
2016-05-27 13:49
ok

jyellick
2016-05-27 13:51
@simon @kostas @tuand : I'd like to get 1557, 1586, 1595, 1596, 1614, and 1622 merged today is possible if you could review and signoff

kletkeman
2016-05-27 14:00
has joined #fabric-consensus-dev

simon
2016-05-27 14:02
tuand: well, or we just start unicasting from consensus directly

simon
2016-05-27 14:03
jyellick: yea, let's go with it. i'd prefer to change that timerfactory stuff, but i think it can be done later

jyellick
2016-05-27 14:06
@simon: Yes, there will definitely be some tweaks down the line

kostas
2016-05-27 14:06
@simon: what's the logic behind moving broadcasting to consensus?

simon
2016-05-27 14:08
kostas: that broadcast doesn't block when sending to byzantine replicas

simon
2016-05-27 14:08
and consensus doesn't send messages to non-whitelisted replicas

kostas
2016-05-27 14:10
got it, good call

simon
2016-05-27 14:14
does anybody have a suggestion how to do periodic view changes?

simon
2016-05-27 14:14
i guess the primary would stop accepting new requests

simon
2016-05-27 14:15
the backups send a view-change when they have a commit certificate for the last request of the primary

simon
2016-05-27 14:16
and new pre-prepares that should not have been sent by the primary directly lead to a view change

simon
2016-05-27 14:17
but rotating the primary means that requests may be lost initially, if the request was sent to the primary that now becomes a backup

simon
2016-05-27 14:17
so that means we probably should be broadcasting requests again...

simon
2016-05-27 14:17
it is all a bit hacky

jyellick
2016-05-27 14:28
@simon: My perhaps naive vision for periodic view change, was to have a config variable set, which would be "change view of n checkpoints", so, once the good primary hits that number, it sends a VIEW-CHANGE (That way we never change view with a non-empty pset and the xset just has the single null request)

jyellick
2016-05-27 14:29
The backup who becomes the new primary would send a VIEW-CHANGE and a NEW-VIEW

jyellick
2016-05-27 14:30
Shouldn't we already be resubmitting requests after a view change?

simon
2016-05-27 15:16
batch does, yes

vukolic
2016-05-27 15:28
As discussed on the last HL Arch WG on Wednesday - the proposal for the next consensus architecture is posted here: https://github.com/hyperledger/fabric/wiki/Next-Consensus-Architecture-Proposal

vukolic
2016-05-27 15:28
please review, comment, discuss and contribute

simon
2016-05-27 15:54
great!

cbf
2016-05-27 16:19
thanks for getting this out @vukolic will be good to get some others to weigh in on the paper

tuand
2016-05-27 16:46
@vukolic, can you also announce on the hyperledger mailing lists ?

vukolic
2016-05-27 16:47
i am just doing that

vukolic
2016-05-27 16:47
I will send to hl-fabric

vukolic
2016-05-27 16:47
not sure if I should to others?

tuand
2016-05-27 16:49
there's hyperledger-architecture-wg

tuand
2016-05-27 16:50
i thought there was a fabric-announce listserv but can't find it now

simon
2016-05-27 18:10
jyellick: how do i wait for the pbft core to finish (for testing?)

jyellick
2016-05-27 18:11
I just commented on your PR

jyellick
2016-05-27 18:12
I've been converting tests which were not network based, but only a single instance of the PBFT core, to simply not start up the event manager, and instead, have the test manually inject events in via `sendEvent(pbftCore, event)`

jyellick
2016-05-27 18:12
With the latest changes, you can inject any message, like for instance `sendEvent(pbftCore, &PrePrepare{...})`

simon
2016-05-27 18:12
ah!

simon
2016-05-27 18:13
well eventually we should have a test event manager that doesn't use timers/goroutines

jyellick
2016-05-27 18:13
Will need to come up with some better solution for the networked ones in the future

jyellick
2016-05-27 18:13
Yes, exactly, eventually we'll want the test event manager to run entirely on the test thread

jyellick
2016-05-27 18:13
No more non-determinism in our network tests

simon
2016-05-27 18:14
yes

jyellick
2016-05-27 18:14
(The network based ones today, you can inject messages with `pbftCore.manager.queue() <- &Message{}`, which is ugly and I hate, and should go away once we fix up the mock network)

simon
2016-05-27 18:15
right

jphillips
2016-05-28 06:01
has joined #fabric-consensus-dev

mandler
2016-05-30 11:31
architecture

simon
2016-05-30 12:13
so the problem with periodic view change is that after a crash fault, I don't know what the next change should be

simon
2016-05-30 12:16
i guess a crashed replica could just generally send a view change when it comes up...

ghaskins
2016-05-30 12:26
@simon: if you remember that work I did for @kostas last November, I had a state called “convening” that was used precisely in that case

ghaskins
2016-05-30 12:26
it was a state that was entered either at start up or if the network lost quorum….it had special rules for view changes that accounted for the fact that the network may be way ahead

ghaskins
2016-05-30 12:26
could probably do something like that here

yingfeng
2016-05-31 02:54
has joined #fabric-consensus-dev


simon
2016-05-31 12:56
would appreciate review. for now I implemented a dumb version of just cycling after a fixed number of requests from a primary, but we could "optimize" it by cycling after stable checkpoints. The advantage of cycling after stable checkpoints is that the new primary does not have to repeat any request; the disadvantage is that it may stop the network for considerable time (wait for execution to finish at 2f+1 replicas).

simon
2016-06-01 13:56
jyellick: you around?

jyellick
2016-06-01 13:56
I am

simon
2016-06-01 13:57
i'm looking at a mock event manager

simon
2016-06-01 13:57
the issue seems to be that queue() returns a chan, which implies that there needs to be a goroutine somewhere to service that chan

jyellick
2016-06-01 13:57
So, the real event manager does have a goroutine

simon
2016-06-01 13:58
yes i know

jyellick
2016-06-01 13:58
For the mock, I think that thread should be the test thread

simon
2016-06-01 13:58
the problem is, there needs to be one goroutine per manager

jyellick
2016-06-01 13:58
Why?

simon
2016-06-01 13:58
because there need to be as many queues as managers

simon
2016-06-01 13:59
and i can't select on an array of queues

jyellick
2016-06-01 13:59
Ah, it's possible via reflection, though a bit slower

jyellick
2016-06-01 13:59
(which might be fine for tests, though I'd keep it out of the production path)

simon
2016-06-01 14:00
how?


simon
2016-06-01 14:01
oh god

simon
2016-06-01 14:01
if queue() would work like inject(), then this wouldn't be necessary

jyellick
2016-06-01 14:03
If this is a good reason to make `queue` accept a parameter rather than return a channel, then I'm fine with changing it.

jyellick
2016-06-01 14:03
The channel initially seemed more flexible, but in this case, I can see where the function call is

simon
2016-06-01 14:03
okay

simon
2016-06-01 14:04
i'll change it to see how the code looks

jyellick
2016-06-01 14:04
I've also got some pending changes which pulls the manager out of `pbft-core.go` which I'll post once I work the bugs out of it

jyellick
2016-06-01 14:05
(basically just moves it out a layer, so the consumer, or pbftendpoint are the ones holding the reference)

simon
2016-06-01 14:07
but the timer factory is still passed in?

jyellick
2016-06-01 14:07
Right, the timerfactory is now a parameter

simon
2016-06-01 14:07
so... timer events don't work without the queue

simon
2016-06-01 14:07
hmm

jyellick
2016-06-01 14:07
What do you mean?

simon
2016-06-01 14:07
but the timer events know the implementation

jyellick
2016-06-01 14:08
They do

simon
2016-06-01 14:08
and can use the events chan directly

jyellick
2016-06-01 14:08
Right

simon
2016-06-01 14:26
hmmm

simon
2016-06-01 14:26
but then i can't mock it for the tests

jyellick
2016-06-01 14:40
Why not?

simon
2016-06-01 14:48
if queue() does not return a chan, then the timer implementation needs to reach inside the manager to access the queue

simon
2016-06-01 14:48
but to reach inside the mananger, it cannot use an interface, but needs to use the impl type

simon
2016-06-01 14:49
which means that for testing the timer, we cannot use a mock manager.

simon
2016-06-01 14:54
so, hm.

jyellick
2016-06-01 14:57
Right, hmm

jyellick
2016-06-01 14:57
Well, it might be worth going ahead and try to start turning the non-mock manager into a timer wheel

jyellick
2016-06-01 14:58
I think essentially, it will still need one of the current style event timers, which is simply set to fire at the next most target event time

jyellick
2016-06-01 14:58
But might simplify the mocking

simon
2016-06-01 15:14
yea

simon
2016-06-01 15:15
i just would like to keep the changes small

simon
2016-06-01 15:15
i guess i could introduce an internal interface


simon
2016-06-01 15:48
hmm

simon
2016-06-01 15:49
setReceiver seems odd

simon
2016-06-01 15:50
but yea

jyellick
2016-06-01 17:09
There's the weird bidirectional dependency between the event manager and event receiver. The receiver needs some way to create timers the event manager understands, which means the timer factory needs a reference to the manager, which needed a reference to the receiver. So, the `setReceiver` is a little odd, but was the cleanest way I could come up with breaking that cycle.

sheehan
2016-06-01 18:30
Hi @jyellick @simon - are there any consensus PRs ready for review?

jyellick
2016-06-01 18:32
I think 1605 is good to go, I can add a comment to that effect

sheehan
2016-06-01 18:33
thanks

jyellick
2016-06-01 18:33
I think 1623 could be closed, while @tuand reworks it, maybe he can chime in yay or nay

sheehan
2016-06-01 18:34
np to leave it open. whatever works best for you

tuand
2016-06-01 18:34
i'll close it ... waiting for kostas to update 756

kostas
2016-06-01 18:35
it can be closed regardless of 756 updates

jyellick
2016-06-01 18:36
I'm not sure what @simon's response to 1675 in channel meant, I think it was tacit approval, but we can wait for more review from him, or maybe @kostas or @tuand can review now, was going to take a look at 1663 now that it's been rebased

tuand
2016-06-01 18:37
1623 closed

tuand
2016-06-01 18:42
taking a look at 1675 now ... can't find simon's slack comment but probably better if documented in PR

chenhua
2016-06-02 04:13
has joined #fabric-consensus-dev

jyellick
2016-06-02 14:07
@simon: Trying to trace this view change code, seems like it can't be right, feel like I'm going crazy, do you have a minute?

simon
2016-06-02 14:07
sure

jyellick
2016-06-02 14:08
If you could take a look at `recvViewChange`

jyellick
2016-06-02 14:09
Down to the `if len(replicas) >= instance.f+1` piece, basically, if we've got a weak cert of view change messages, we need to send a view change, seems right so far to me

jyellick
2016-06-02 14:09
So we call `instance.sendViewChange`

jyellick
2016-06-02 14:10
Which, without really any conditionals, builds a view change message, and then invokes `recvViewChange`

jyellick
2016-06-02 14:10
Which obviously will then satisfy the condition `len(replicas) >= instance.f+1` because it was just matched and we added another

jyellick
2016-06-02 14:11
Which will then call `sendViewChange` again, which will then send it again, which will then call `recvViewChange` again, which will detect that we already have this view change message, and return nil.

jyellick
2016-06-02 14:12
And nothing in this path will trigger processing the new view.

simon
2016-06-02 14:12
but that's not what we see happening?

jyellick
2016-06-02 14:13
It might be, I guess nothing in that should break anything, but we should always be double broadcasting view change messages, which I hadn't noticed.

jyellick
2016-06-02 14:13
(Or rather, not always, but often)

jyellick
2016-06-02 14:15
Mostly just wanted a sanity check on my reasoning, make sure I'm not missing something obvious (have been staring at this particular code path too long)

simon
2016-06-02 14:15
maybe we need a test to see whether this really happens

simon
2016-06-02 14:16
i only see 2 scenarios where this would appear as a bug:

simon
2016-06-02 14:16
1. primary is slow with view change and receives f+1 messages, but then fails to send new-view

jyellick
2016-06-02 14:16
If our view-change was the 2f+1th, it would delay sending/processing a new view. But, for f>0 I don't think that's actually possible

simon
2016-06-02 14:17
2. we already received new-view and we are slow with view-change, and now received the f+1 view-change.

jyellick
2016-06-02 14:17
Hmmm, yes, I will write a test for this, should include one with the fix

jyellick
2016-06-02 14:19
Actually, yes, I'm worried that once we have the f+1 view change messages, it will never trigger the new view processing via this path, because it will always d the `sendViewChange` loop which returns before hitting the new view path.

jyellick
2016-06-02 14:19
Alright, thanks for the chat, will go write that test.

simon
2016-06-02 14:22
so then we shouldn't return

simon
2016-06-02 14:24
HA!

simon
2016-06-02 14:24
no

simon
2016-06-02 14:24
jyellick: that loop only counts view-change messages that are *above* our own view

simon
2016-06-02 14:25
so the second time we get in there, we do not enter the f+1 branch

jyellick
2016-06-02 14:25
Aha! Thanks!

simon
2016-06-02 14:28
hehe

simon
2016-06-02 14:28
i remember thinking that before

simon
2016-06-02 14:28
the comment is clear as well

simon
2016-06-02 14:28
but maybe we should add another comment

simon
2016-06-02 15:16
our coverage doesn't look bad at all

tuand
2016-06-02 15:18
@tuand uploaded a file: https://hyperledgerproject.slack.com/files/tuand/F1DL2KWTY/aa.out and commented: this is what i got for test coverage . do `go tool cover -html=aa.out` on this file

jyellick
2016-06-02 15:28
There are a few pieces that looks like they could improve, but generally, yeah, not too bad.

ghaskins
2016-06-02 15:58
@simon @jyellick seeing lots of failures in the consensus/obcpbft unit tests in presumably unrelated branches of the code, are you aware of this?

jyellick
2016-06-02 15:58
What failures?

ghaskins
2016-06-02 15:58
I just restarted one in travis, I suspect I lost the log as a result

ghaskins
2016-06-02 15:58
i have a local run of something that looked identical (though different code change), ill see if I can capture it to a log

jyellick
2016-06-02 15:59
Unit tests have run reliably locally for me and in the CI I've pushed, so would appreciate a pointer to the failure

ghaskins
2016-06-02 16:00

ghaskins
2016-06-02 16:00
but note I have seen it in at least three unrelated branches now

ghaskins
2016-06-02 16:00
but same basic failure (at least when looking from 50k feet up

jyellick
2016-06-02 16:01
Ah, yeah, that's a new test for a feature that was merged a couple days ago

ghaskins
2016-06-02 16:01
gut feeling, it might be flaky

jyellick
2016-06-02 16:01
Yeah, I'll take a look, thanks @ghaskins

ghaskins
2016-06-02 16:01
but more research needed

ghaskins
2016-06-02 16:02
ok, thanks!

simon
2016-06-02 16:15
is the timeout too short?

jyellick
2016-06-02 16:33
That's my immediate guess

tuand
2016-06-02 18:34
i looked at the test coverage results for consensus and statetransfer ... looks like the one that could use more unit tests is `obc-batch (67%)` ?

tuand
2016-06-02 18:36
others like `obc-classic` is really the broadcast() and unicast() calls ... and `helper`, `controller`, etc ... are tested by the obcpbft and behave tests

jyellick
2016-06-02 19:24
@tuand: @kostas @simon Review of https://github.com/hyperledger/fabric/pull/1689 would be appreciated

jyellick
2016-06-02 19:28
@ghaskins: I've included a bump in the timeout for that failing test in the PR above, I can break it out separately if you'd prefer

ghaskins
2016-06-02 19:28
@jyellick: yes, please…id like that to go in ASAP, larger series might take longer to review

jyellick
2016-06-02 19:29
Understood, will do

ghaskins
2016-06-02 19:29
ty

ghaskins
2016-06-02 19:29
(and thanks for quick fix)

jyellick
2016-06-02 19:33
@ghaskins: https://github.com/hyperledger/fabric/pull/1690 it's a pretty trivial change, sorry I did not push it earlier, was trying to finish up that other PR

ghaskins
2016-06-02 19:34
i totally understand, no apology necessary

ghaskins
2016-06-02 20:38
@jyellick: still seems to fail CI, can you have a look?

jyellick
2016-06-02 20:38
Was just looking at that, different failure than the last

jyellick
2016-06-02 20:39
In `TestSieveNoDecision` this time

ghaskins
2016-06-02 20:39
ah, ok, I didn’t notice the subtlety there, sorry

jyellick
2016-06-02 20:42
No problem, wonder if maybe Travis is generally under higher load lately or something, nothing in that code path should really be new

jyellick
2016-06-02 21:11
@ghaskins: It's not entirely obvious what went wrong from the log, but, under the theory that it's just decreased Travis performance, I've tuned the timeouts in that test up by 50% as well, and enhanced the logging statement that looked suspicious to be a little more expressive. Pushed and will see how it goes.

ghaskins
2016-06-02 21:30
Cool, thanks

yingfeng
2016-06-03 09:37
Hi, has this issue proved some problems of the consensus protocol? https://github.com/hyperledger/fabric/issues/1701

simon
2016-06-03 11:44
no

simon
2016-06-03 11:44
i don't think so

simon
2016-06-03 11:45
or at least, not necessarily

simon
2016-06-03 11:47
yingfeng: it would be good if you could look at the blocks stored at the peers

vukolic
2016-06-03 12:52
@simon @tuand @jyellick Guys I need consensus w. here


simon
2016-06-03 12:53
hi

vukolic
2016-06-03 12:53
what can or must we fix in the next two weeks

vukolic
2016-06-03 12:53
except #1338

simon
2016-06-03 12:53
i think we went through this already

simon
2016-06-03 12:53
i'm on 1478

vukolic
2016-06-03 12:53
yes - but I am struggling to have conclusions

vukolic
2016-06-03 12:53
e.g., #1340 is nice to have - not a must

tuand
2016-06-03 12:53
i replied to sharon on these ?

vukolic
2016-06-03 12:53
so probably not

vukolic
2016-06-03 12:54
I know I know - but she asks again :slightly_smiling_face:

tuand
2016-06-03 12:54
:slightly_smiling_face:

simon
2016-06-03 12:54
does she want a different answer?

vukolic
2016-06-03 12:54
can we quickly do a rep here on what we commit for 2 weeks from now

vukolic
2016-06-03 12:54
I think she wants a concrete answer :slightly_smiling_face:

simon
2016-06-03 12:54
1338 is simple

vukolic
2016-06-03 12:54
yes

vukolic
2016-06-03 12:54
so that is in

simon
2016-06-03 12:55
1608 has to happen in ledger, so that's nothing we can fix ourselves

vukolic
2016-06-03 12:55
ok

vukolic
2016-06-03 12:55
agree

simon
2016-06-03 12:55
1478 is being worked on and is sort of important

vukolic
2016-06-03 12:55
ok noted

simon
2016-06-03 12:55
1340 not happening

vukolic
2016-06-03 12:55
ack

tuand
2016-06-03 12:55
1180 should be closed

vukolic
2016-06-03 12:55
ack

simon
2016-06-03 12:56
1098 is related to 1478

simon
2016-06-03 12:56
but is more than that

simon
2016-06-03 12:56
also it shouldn't be labeled consensus

vukolic
2016-06-03 12:56
1098 (stack wide)?

tuand
2016-06-03 12:56
1180 simon ?

simon
2016-06-03 12:56
yes 1098

vukolic
2016-06-03 12:57
ok then

vukolic
2016-06-03 12:57
anything else?

tuand
2016-06-03 12:57
agree 1098 is not just consensus

vukolic
2016-06-03 12:57
for other things we have PRs submitted right (e.g., periodic leader rotation)

vukolic
2016-06-03 12:57
?

simon
2016-06-03 12:57
yes

vukolic
2016-06-03 12:57
ok

vukolic
2016-06-03 12:57
thanks

vukolic
2016-06-03 12:57
sorry for distrubance

tuand
2016-06-03 12:58
#1701 which just came in

vukolic
2016-06-03 12:58
aha - let me look

simon
2016-06-03 12:58
there is 1701, which is related to 1098

vukolic
2016-06-03 12:59
hm 1701 looks bad

vukolic
2016-06-03 12:59
I will put that one in

vukolic
2016-06-03 12:59
so - to recap

tuand
2016-06-03 12:59
i'll try to reproduce in the hackathon today

vukolic
2016-06-03 13:00
#1701, #1478, #1338 and #1098 (to a consensus extent)

vukolic
2016-06-03 13:00
+ #1180 should be closed

vukolic
2016-06-03 13:00
is that ok?

tuand
2016-06-03 13:02
i'd rather we do not have 1098 on the list ? should really be a system design

vukolic
2016-06-03 13:02
ok, let me get rid of it

vukolic
2016-06-03 13:02
else is fine?

tuand
2016-06-03 13:02
ok

vukolic
2016-06-03 13:02
ok thks

vukolic
2016-06-03 13:07
in 1701 it would be ok if he sees one different reply

vukolic
2016-06-03 13:07
but not all 4 :slightly_smiling_face:

vukolic
2016-06-03 13:07
as he actually waits for systems to "settle down", so not really issuing reads concurrently with a write

simon
2016-06-03 13:13
yea no idea what is going on there

simon
2016-06-06 10:17
I've been pondering for hours about how to do this 2f+1 send limit

simon
2016-06-06 10:17
this is real tricky

simon
2016-06-06 10:24
the problem i'm having is that i think i shouldn't treat error results (= no send) as success

simon
2016-06-06 10:28
Let's say we are in a situation where f replicas are disconnected and byzantine, and the rest is connected, but a bit slow. If I don't consider send errors as byzantine, I will end up waiting only for f+1 replicas, instead of for 2f+1, because the other f erroring ones complete most quickly.

simon
2016-06-06 10:30
Now let's say I consider errored ones as byzantine. Maybe they are not, and we just lost the network connection, and instead there are f byz replicas that don't accept data. Then I would be blocked because while waiting for 2f+1 sends to succeed, f byz replicas are not accepting data and the sends never finish.

simon
2016-06-06 10:35
So that doesn't work. It means send errors need to be treated like "it's still sending", i.e. we need to retry sending. Now the question becomes, how long should we wait for the network to become available? If we say "indefinitely", then we stop operation on network failures. Which may be fine, but seems counter to the spirit of PBFT. If we say "don't wait", we will start discarding messages, maybe without need. Imagine a deploy transaction - huge amount of data. Following messages won't fit into the buffers and will be discarded. That won't be good for system performance.

simon
2016-06-06 10:35
That leaves me with some arbitrary send timeout, which feels uncomfortable, but seems necessary.

kostas
2016-06-06 10:37
define "send errors"?

simon
2016-06-06 10:43
anything where send returns an error

simon
2016-06-06 10:44
typically disconnected peers, i'd guess


ghaskins
2016-06-06 11:09
I went through a similar dilemma trying to account for things beyond f problems. Then I realized: you can't do much, it essentially undefined at that point.

ghaskins
2016-06-06 11:09
So, if you have f bad nodes and a few more slow nodes, just do whatever is the most conservative

ghaskins
2016-06-06 11:10
In this case, that might be to block for a really long time

ghaskins
2016-06-06 11:11
The algorithms are only designed to guarantee loveliness if you have not crossed the Byzantine threshold

ghaskins
2016-06-06 11:12
Liveliness

ghaskins
2016-06-06 11:12
(Thanks, autocorrect)

ghaskins
2016-06-06 11:14
Problems arise when you try to handle the condition in an effort to make forward progress

kostas
2016-06-06 11:32
the paper states that it "guarantees liveness provided message delays are bounded eventually"

kostas
2016-06-06 11:32
so a timeout seems in order here

ghaskins
2016-06-06 11:57
maybe, but that isnt necessarily the conclusion to draw

ghaskins
2016-06-06 11:58
it just simply says that liveness is predicated on messages within the 2f+1 arriving eventually…that has no bearing on what to do if they don't

kostas
2016-06-06 11:59
that is correct - I am looking for a practical fix

ghaskins
2016-06-06 12:00
The system only works at approximately the speed of the healthiest 2f+1 nodes…the moment you don’t have 2f+1 healthy nodes, all bets are off…but the most conservative thing to do might be to wait until you do

ghaskins
2016-06-06 12:00
i guess the question is: what do we hope to accomplish with the timeout?

ghaskins
2016-06-06 12:01
if merely logging, that is harmless enough

ghaskins
2016-06-06 12:01
anything beyond that starts to get into sketchy territory, where we are no longer covered by the proofs

kostas
2016-06-06 12:03
the timeout would fix #1478 and #1056

ghaskins
2016-06-06 12:03
looks

kostas
2016-06-06 12:03
other than that, as I said, I agree - did Marko have a look into this?

simon
2016-06-06 12:04
i didn't yet talk to him about the specific thoughts i came up with

ghaskins
2016-06-06 12:04
i see, seems like we are talking about a different problem here

simon
2016-06-06 12:05
technically you can just discard messages

ghaskins
2016-06-06 12:05
in 1478, IIUC, its 4 out of 5 nodes that are healthy

simon
2016-06-06 12:05
but if you're too aggressive about that, then you basically produce a constantly failing network

kostas
2016-06-06 12:05
@ghaskins: correct, you don't cross the `f` threshold

ghaskins
2016-06-06 12:05
@kostas: ok, apologies, I thought @corecode mentioned situations beyond f failures

kostas
2016-06-06 12:06
he did :simple_smile:

simon
2016-06-06 12:06
f byzantine plus network failures

simon
2016-06-06 12:06
which are not considered byzantine

kostas
2016-06-06 12:06
he also included the link to #1056 which is what I was addressing

ghaskins
2016-06-06 12:07
@simon: I guess it depends on your definition of byzantine, but isn’t it true that you cant easily tell the difference?

simon
2016-06-06 12:07
you cannot

simon
2016-06-06 12:07
and that's exactly the problem

ghaskins
2016-06-06 12:07
right, thats my point…once you cross f, regardless of the reason, all bets are off in terms of how you may reasonably respond

ghaskins
2016-06-06 12:08
it seems the most conservative approach is to assume that forward progress is impeded and actively attempt to restore connectivity until we are below the threshold again

simon
2016-06-06 12:08
no, network failure does not count towards f

ghaskins
2016-06-06 12:09
that makes no sense to me

ghaskins
2016-06-06 12:09
how can you discern “network failure” from any other class of error where the node isnt responsive

kostas
2016-06-06 12:09
FWIW, it doesn't make sense to me either

simon
2016-06-06 12:09
you can't

simon
2016-06-06 12:09
but it still doesn't mean that more than f nodes are byzantine

ghaskins
2016-06-06 12:10
I dont care if I cant reach node X because the trans-atlantic hop went down, the DC it lives in lost power, or the software that runs on it was hacked

simon
2016-06-06 12:10
yes, you don't care

ghaskins
2016-06-06 12:10
either way, its not a healthy node as part of the 2f+1 from the perspective of the observer

simon
2016-06-06 12:10
no

simon
2016-06-06 12:10
an impartial observer can tell that there is a network partition

simon
2016-06-06 12:11
and all the proofs are based on that

ghaskins
2016-06-06 12:11
im not sure that matters though

ghaskins
2016-06-06 12:11
each node is its own observer and nodes that cannot be reached are effectively byzantine

simon
2016-06-06 12:11
no they are not

simon
2016-06-06 12:12
because if they were byzantine and i knew it

simon
2016-06-06 12:12
i would have to stop

ghaskins
2016-06-06 12:12
that sounds like a semantic debate

ghaskins
2016-06-06 12:13
we agree the network is decentralized, right? there is no impartial byzantine status manager, right?

simon
2016-06-06 12:13
on the node, i can't tell what is what, that's correct

ghaskins
2016-06-06 12:13
correct, you cant

simon
2016-06-06 12:13
but this is not just theory

ghaskins
2016-06-06 12:13
all you can tell is how many nodes are responding and in agreement

simon
2016-06-06 12:13
we're trying to build a system that meets expectations that are stronger than what PBFT can provide

ghaskins
2016-06-06 12:14
ok, but lets finish this thought:

ghaskins
2016-06-06 12:14
all we can tell is how many nodes are responding and in agreement, nothing more….and we can try to be helpful to other nodes by responding ourselves

simon
2016-06-06 12:14
right

ghaskins
2016-06-06 12:15
from that, the system, from the perspective of this node, can make forward progress if 2f+1 are responding and in agreement

ghaskins
2016-06-06 12:15
and we cannot make forward progress if we have less than that

ghaskins
2016-06-06 12:16
we will quickly try to run into trouble if you try to get fancy with how you respond after f has been crossed

simon
2016-06-06 12:16
well, you have to do something

ghaskins
2016-06-06 12:16
the only thing you can really do is recognize that we have to wait until we get back to 2f+1

simon
2016-06-06 12:16
in PBFT, when the timeout expires, you do a view change

ghaskins
2016-06-06 12:16
yes, service is out (at least on your side of the partition) but that is all you can really do

simon
2016-06-06 12:17
and hope that it was the primary's fault

simon
2016-06-06 12:17
or maybe it is the network and eventually you'll sync up

ghaskins
2016-06-06 12:17
trying to do a view change in a subordinate or inferior partition will be fruitless though

simon
2016-06-06 12:17
yes

ghaskins
2016-06-06 12:17
yes, you should perhaps try, yes

simon
2016-06-06 12:17
that's on the receiving side

ghaskins
2016-06-06 12:18
but really, once >f, you are stalled

simon
2016-06-06 12:18
what i'm at is the sending side

ghaskins
2016-06-06 12:18
thats fine, i think we are talking about slightly different things now anyay

ghaskins
2016-06-06 12:18
anyway

simon
2016-06-06 12:18
currently the code waits for all sends to return

ghaskins
2016-06-06 12:19
i thought you were talking about channel timeouts, which would be a different consideration from view-change timeouts

simon
2016-06-06 12:19
which fails if a node deliberately doesn't service its connection

ghaskins
2016-06-06 12:19
yes, you need to wait for 2f+1, not all

ghaskins
2016-06-06 12:19
anyway, have to get kiddos to school, bbiab

simon
2016-06-06 12:19
ok

simon
2016-06-06 12:39
it is all difficult - even if 2f+1 nodes accept the send, there may still be f nodes that don't

simon
2016-06-06 12:42
their send may block. but that means that i might queue an unbounded amount of data to these non-responding nodes

philippe
2016-06-06 15:28
has joined #fabric-consensus-dev

jyellick
2016-06-06 16:01
@simon: I think you fell off the call

simon
2016-06-06 16:03
i did :confused:

jyellick
2016-06-06 16:10
There should be no scenario where the PBFT event thread stops reading messages off of the receive queue

jyellick
2016-06-06 16:10
Actually, there is a piece of code that really needs to be reworked, which I think could be causing those buffers to fill

jyellick
2016-06-06 16:11
Somehow, after all of the cleanup that's been done, in batch, the execution is being done on the main event thread now

jyellick
2016-06-06 16:12
So, for deploy transactions in particular, we could end up blocking for somewhat long periods of time

jyellick
2016-06-06 16:12
It's been on my TODO list to fix, maybe now if the time.

simon
2016-06-06 16:12
ooooh

simon
2016-06-06 16:13
yea

simon
2016-06-06 16:13
that with slow execution could explain both timeouts and discarded messages

jyellick
2016-06-06 16:14
Sounds like a reasonable priority for me for the moment then, I'll focus on that unless you see something else more pressing?

simon
2016-06-06 16:14
that sounds good and not too complicated

simon
2016-06-06 16:14
now with all events in place

jyellick
2016-06-06 16:14
Right

simon
2016-06-06 16:15
should i look into commit on checkpoint?

jyellick
2016-06-06 16:16
I'm a little worried we're going to be touching the same code there (since it's both in the batch execution path)

simon
2016-06-06 16:16
yes

simon
2016-06-06 16:16
that was my thought

simon
2016-06-06 16:16
i'll hold off for now

simon
2016-06-06 16:16
and think about the sending issue

jyellick
2016-06-06 16:17
Sounds good, thanks @simon!


jyellick
2016-06-06 17:33
@kostas: @tuand ^ for review as well

vukolic
2016-06-06 20:53
guys - I have second thoughts on commit on checkpoint

vukolic
2016-06-06 20:54
in the case of #1701 or #1545 - what this would do is simply block the whole network for good - since we cannot make a valid checkpoint

vukolic
2016-06-06 20:54
with that - we are not addressing the source of issues that lead to #1701 and #1545

vukolic
2016-06-06 20:55
we need to understand: 1) do they have non-deterministic chaincode

vukolic
2016-06-06 20:55
or 2) we have a bug in execution / state hash calculation

vukolic
2016-06-06 20:56
in case of 1) - my response is - we do not care and will not do anything

vukolic
2016-06-06 20:57
in case of 2) we obviously need to fix

vukolic
2016-06-06 20:57
and re - commit on checkpoints - I do not think that blocking the entire network for good is actually a sufficiently good resolution of these issues

jyellick
2016-06-06 20:58
@vukolic: In the event that there are 2f+1 matching checkpoints, any of the remaining f which disagree would believe themselves to be byzantine, and recover via state transfer

vukolic
2016-06-06 20:58
indeed but logs from JPX show all 4 peers diverging

vukolic
2016-06-06 20:58
at that point not only you cannot make a stable checkpoint

vukolic
2016-06-06 20:58
but cannot trust anyone to actually transfer state from

vukolic
2016-06-06 20:58
so we block

jyellick
2016-06-06 20:58
If they all four diverge at the same point, then yes, we would have a problem, I had not seen any logs which actually included the problem

vukolic
2016-06-06 20:59
again this may fix this run - but certainly does not fix the cause of the issue

jyellick
2016-06-06 20:59
Certainly the 'commit on checkpoint' is not a solution to non-determinism, but it would prevent inconsistent state from being committed across the network

vukolic
2016-06-06 21:00
agree wit hthat - but it does not solve the source of the issue

jyellick
2016-06-06 21:00
Completely agree with respect to (1) and (2)

vukolic
2016-06-06 21:00
non-determinism should not be there

jyellick
2016-06-06 21:00
If (2), this is a very serious bug that _must_ be identified and fixed

jyellick
2016-06-06 21:00
If (1), they should fix, but, committing on checkpoint would prevent a client from querying, and receiving a value which ultimately may not be committed

vukolic
2016-06-06 21:01
anyway - it would be great to make that (checkpoint on commit) configurable

vukolic
2016-06-06 21:01
so we can actually configure the "normal" pbft and "paranoid" pbft

jyellick
2016-06-06 21:02
Yes, @simon and I discussed this some on the phone, we should be able to make this configurable

vukolic
2016-06-06 21:02
ok - but again we need to also look at the source of the issue

vukolic
2016-06-06 21:02
do we have the chaincode?

jyellick
2016-06-06 21:02
I have not seen the logs to which you're referring, last I checked, I found the logs were not complete to the point they described the problem (and only for 2 nodes)

jyellick
2016-06-06 21:03
And we were missing chaincode logs, but I could be out of date

vukolic
2016-06-06 21:03
this is what I saw as well - but we need actual chaincode to look for non-determinism

vukolic
2016-06-06 21:03
we do not have that right?

tuand
2016-06-06 21:03
barry/mihir are replicating the environment ... should have more logs in the morning

tuand
2016-06-06 21:03
and chaincodes

vukolic
2016-06-06 21:03
ok then - pls post here when you have the chaincodes

vukolic
2016-06-06 21:03
BTW - as a general rule

vukolic
2016-06-06 21:04
we must tell folks to submit chaincode + logs

vukolic
2016-06-06 21:04
otherwise no bug of this kind is reproducible

tuand
2016-06-06 21:04
+1

vukolic
2016-06-06 21:04
i meant - do not post chaincode here :slightly_smiling_face:

vukolic
2016-06-06 21:04
but post a notice

vukolic
2016-06-06 21:04
thks!

jyellick
2016-06-06 21:05
@vukolic: I am not an expert with the chaincodes, but I believe for many, this is where their 'application' which is being built on top of the fabric lives, I am not certain everyone will be happy/willing to share this

jyellick
2016-06-06 21:05
But when available, certainly it could be helpful

vukolic
2016-06-06 21:05
ack but then it is easy for us - we offer no service :slightly_smiling_face:

vukolic
2016-06-06 21:06
until there is a chaincode utility with which they can provably show it is deterministic :slightly_smiling_face:

vukolic
2016-06-06 21:06
in zero knowledge

yingfeng
2016-06-07 09:57
How do I config pbft to be the consensus plugin of peer node? This is my config: ``` # Validator defines whether this peer is a validating peer or not, and if # it is enabled, what consensus plugin to load validator: enabled: true consensus: # Consensus plugin to use. The value is the name of the plugin, e.g. pbft, noops ( this value is case-insensitive) # if the given value is not recognized, we will default to noops plugin: batch # total number of consensus messages which will be buffered per connection before delivery is rejected buffersize: 1000 ``` but I still see the outputs of a `noops` plugin is created: ``` [36m09:46:45.612 [consensus/statetransfer] blockThread -> DEBU 02bESC[0m name:"vp0" has validated its blockchain to the genesis block 09:46:45.612 [consensus/noops] newNoops -> INFO 02cESC[0m NOOPS consensus type = *noops.Noops 09:46:45.612 [consensus/noops] newNoops -> INFO 02dESC[0m NOOPS block size = 500 09:46:45.612 [consensus/noops] newNoops -> INFO 02eESC[0m NOOPS block timeout = 1s ```

simon
2016-06-07 09:58
you set the consensus type in the core.yaml file

yingfeng
2016-06-07 09:58
@simon: `plugin: batch` is configured in the core.yaml file

simon
2016-06-07 09:59
hmm

simon
2016-06-07 09:59
ah no, plugin: obcpbft

simon
2016-06-07 09:59
i think

yingfeng
2016-06-07 10:00
...

kostas
2016-06-07 10:00
`plugin: pbft`


yingfeng
2016-06-07 10:01
@kostas: got it, thank you

simon
2016-06-07 10:10
that should definitely be in a comment

simon
2016-06-07 10:10
what options you have

yingfeng
2016-06-07 11:00
When I use `plugin: pbft` and setup the p2p network, I could see each node has the classic pbft plugin worked. However, when I send a chaincode deploy json to them, why corresponding docker images could not be created anymore? And I would get a failure when send query request. This is the output of `docker images`: ``` REPOSITORY TAG IMAGE ID CREATED SIZE hyperledger-peer latest aad903f81518 52 minutes ago 2.066 GB hyperledger/fabric-baseimage latest 5d4fe4b975c6 4 days ago 1.384 GB ``` No such containers with name of `dev-vp0-7b07c59e9b9405c1aef33493b63b9a766d9bb836989ded1730052de650aa8ce5654274d148ceff96a4e5bd43bca26aba099f55c400e4befdc8b2ee4c0a94e30b` has ever been created any more.

simon
2016-06-07 11:01
i don't know

simon
2016-06-07 11:02
without logs i cannot help


simon
2016-06-07 11:07
are you running 4 nodes?

yingfeng
2016-06-07 11:07
yes

yingfeng
2016-06-07 11:07
this is the output of another node which accept the `deploy` request https://transfer.sh/GDJ3w/peer.dbg.out

simon
2016-06-07 11:12
what version of the code are you running?

yingfeng
2016-06-07 11:13
latest code with commit of `803e594489dcd011970d168403ee328be0b3da3a`

simon
2016-06-07 11:16
i don't have that

simon
2016-06-07 11:16
ah no

simon
2016-06-07 11:16
weird

simon
2016-06-07 11:17
ah, try pbft batch

yingfeng
2016-06-07 11:19
ok

yingfeng
2016-06-07 11:36
still the same.. the config is: ``` mode: batch ``` in the `consensus/obcpbft/config.yaml`, and the output log is too huge to be uploaded.. there are many chaotic outputs such as: ``` 2\246\013\322\350,D\321\251\350l\366\025\343\347\003[\013\310\242\261\270\352t\023\335\255C\233\000(\315YZU\252\361\020\277\030\354\346\363$\240\266\334k\233\361\324\250oW\307\204\200\\\002h\334;J \021\306G\214\366\211^\261r\032\326\214@\322o\"t\033\374X`\036\001['\243\200sj\245\254r\327B(%\302\363(\263~*F\331,p\200\002\333+#\034\016!2\217\030\311\352\010\261J.\311\237\376\365\021\342\260p:\002\324\177c\224s\t\013\220#\202\363\277\242\220\262\232C\350F\306\003@.\204\305Qb\ ```

simon
2016-06-07 11:45
yea i don't know what that is

simon
2016-06-07 11:46
some code paths log content

simon
2016-06-07 11:46
which is unfortunate

simon
2016-06-07 11:46
without log i cannot help at all

yingfeng
2016-06-07 11:56
I cut the head 1800 lines of the log file such that those chaotic outputs are not included: https://transfer.sh/JTA0/peer.out the overall log is above 10G .. I just send a deploy request

simon
2016-06-07 12:06
your deploy transaction is huge

simon
2016-06-07 12:06
200MB

simon
2016-06-07 12:06
and your network needs 5s to send that transaction

simon
2016-06-07 12:06
at that point, consensus decides that something is wrong

simon
2016-06-07 12:07
you will have to adjust your timeouts

simon
2016-06-07 12:07
or reduce your transaction size

ghaskins
2016-06-07 12:07
we have patches in the works that may help on that front

ghaskins
2016-06-07 12:07
(by substantially reducing cruft that makes it into the deploy payload today

simon
2016-06-07 12:07
it's all a hack

ghaskins
2016-06-07 12:07
assuming that is the problem

ghaskins
2016-06-07 12:07
what is?

simon
2016-06-07 12:08
the patches to only include certain types of files

ghaskins
2016-06-07 12:08
well, maybe, but its a stopgap anyway

ghaskins
2016-06-07 12:09
the right way to do it is to ask the chaincode for its list of dependencies, that will go in next

ghaskins
2016-06-07 12:09
but, the hack gets us most of the way there, so if you need a quick fix for the 200MB

simon
2016-06-07 12:09
and move the shim dependency out of the main fabric repo

ghaskins
2016-06-07 12:09
that actually doesnt matter any more

simon
2016-06-07 12:09
how does it work then?

yingfeng
2016-06-07 12:10
@simon I just deploy the chaincode example2 ..

simon
2016-06-07 12:10
yingfeng: probably you have some large log files around

ghaskins
2016-06-07 12:10
@simon: see the comment about half-way down in PR 1720 https://github.com/hyperledger/fabric/pull/1720

ghaskins
2016-06-07 12:11
regarding gofiles.sh

ghaskins
2016-06-07 12:11
there are techniques to result the deps (direct and transitive) for any package

ghaskins
2016-06-07 12:11
and from there, there are techniques to ask any package for the files it includes

ghaskins
2016-06-07 12:12
so, you can simply ask a package such as a chaincode for the complete set of packages/files it needs, doesnt matter where they live

ghaskins
2016-06-07 12:12
that is the direction both SDK/NVP for GOLANG and chaintool for CAR need to go

ghaskins
2016-06-07 12:12
(IMO)

ghaskins
2016-06-07 12:13
but in the meantime, the file exclusions work perfectly well, and will substantially reduce the payload size

simon
2016-06-07 12:13
ok

yingfeng
2016-06-07 12:17
@simon This is the code directory of my machine http://pastebin.com/km02KA2P , I build docker image using this directory, there does not exist a log file here..

simon
2016-06-07 12:21
319M fabric/

yingfeng
2016-06-07 12:23
yes, there are some binaries, they are peer and chaincode programs

simon
2016-06-07 12:24
yes

simon
2016-06-07 12:24
that all gets packaged up

ghaskins
2016-06-07 12:25
@yingfeng: if you’d like, you can try running on top of the PR 1708 branch and I think that problem will be mitigated

ghaskins
2016-06-07 12:25
1720 is the more direct/short-term fix, but it needs more work to be ready to work for all cases

ghaskins
2016-06-07 12:25
1708 at least passes CI currently

ghaskins
2016-06-07 12:26
either way, 1708, 1720, or (most likely) an amalgam will be merged asap

yingfeng
2016-06-07 12:26
got it, thanks

sachikoy
2016-06-07 13:22
@vukolic I sent you the chaincode by e-mail.

simon
2016-06-07 13:22
hi sachikoy

sachikoy
2016-06-07 13:25
hi

simon
2016-06-07 13:41
do you have debug logs of the failures in 1545, 1331?

sachikoy
2016-06-07 14:13
here is the logs for 1545, for vp0 an vp1 https://ibm.box.com/s/l9i37p4ex5or44iy3ojbva61d6lzrx43

simon
2016-06-07 14:18
are these new?

simon
2016-06-07 14:19
because the ones i saw yesterday are incomplete

simon
2016-06-07 14:44

sachikoy
2016-06-07 14:45
vp1’s log is incomplete because I lost network connection while downloading the log

jyellick
2016-06-07 14:53
@simon: Interesting. I don't think that sells me on entirely on eliminating channels, but it does reaffirm my intuition that channels and mutexes do not mix nicely

simon
2016-06-07 14:53
sachikoy: without full logs, ideally from all peers, we can't really see what is going on

jyellick
2016-06-07 15:03
@simon @kostas @tuand https://github.com/hyperledger/fabric/pull/1744 this is a simple code rename refactor, if you have a chance to glance at and sign off

tuand
2016-06-07 15:31
so for #1331, what should happen when i do an invoke to a peer that's just been re-started ?

jyellick
2016-06-07 15:59
Depends on how 'just' restarted, and the state of the network

jyellick
2016-06-07 16:00
This all assumes under batch, but if the network has not changed views since the peer went down and back up, then it should forward the request to the current (correct) primary, and it should be ordered by the network

jyellick
2016-06-07 16:00
This is regardless of the current replica's ability to participate in ordering.

jyellick
2016-06-07 16:00
If the peer is more out of sync with the network, it may have to wait until it can eavesdrop into the correct watermarks, and for a view change potentially to pick the correct view

jyellick
2016-06-07 16:01
But eventually, the request should be processed

tuand
2016-06-07 16:05
that's what i expected ... from reading @ratnakar 's latest logs, after peer3 is restarted, the subsequent requests are being forwarded and executed but peer3 isn't starting state transfer ... rechecking now

simon
2016-06-07 16:07
actually when custody runs out, it will broadcast the complaint, and then the system should commit the transaction

simon
2016-06-07 16:08
testing async functions is such a pain

simon
2016-06-07 16:09
i don't understand how people usually test these things

simon
2016-06-07 16:31
jyellick: do we ever move watermarks without being up to date?

jyellick
2016-06-07 16:32
Yes

jyellick
2016-06-07 16:33
@simon When we first detect we are out of date, we move our watermarks to where the network seems to be operating, this is so that we can collect a weak checkpoint certificate so that we can initiate state transfer. It also allows us to buffer new transactions while state transfer takes place.

simon
2016-06-07 16:34
so at that point we start updating p and q sets

jyellick
2016-06-07 16:35
(As I've mentioned before, to not grow unboundedly, we only track 1 checkpoint per peer above our watermarks. Often, we might get a weak cert of matching checkpoints, but it is not guaranteed, we might get f+1 checkpoints for different sequence numbers, or with non-matching hashes, that simply means we are out of date and must listen for a good target)

jyellick
2016-06-07 16:35
Yes, I think that's correct

simon
2016-06-07 16:35
so updating lastExec to the watermark is incorrect?

jyellick
2016-06-07 16:36
Right

jyellick
2016-06-07 16:36
Actually, I think that's a bug in the view change code

jyellick
2016-06-07 16:37
``` if instance.lastExec < cp.SequenceNumber { logger.Warning("Replica %d missing base checkpoint %d (%s)", instance.id, cp.SequenceNumber, cp.Id) snapshotID, err := base64.StdEncoding.DecodeString(cp.Id) if nil != err { err = fmt.Errorf("Replica %d received a view change who's hash could not be decoded (%s)", instance.id, cp.Id) logger.Error(err.Error()) return nil } instance.consumer.skipTo(cp.SequenceNumber, snapshotID, replicas) instance.lastExec = cp.SequenceNumber } ```

jyellick
2016-06-07 16:37
That `lastExec` should be set once the state transfer completes, not at initiation

simon
2016-06-07 16:37
hmm

jyellick
2016-06-07 16:37
(It used to be correct, under the old executor model)

simon
2016-06-07 16:38
right

simon
2016-06-07 16:39
because state transfer may sync to a different seqno

jyellick
2016-06-07 16:40
Well, and additionally, I don't see anything there that would keep us from executing transactions before the state transfer completes, (which would temporarily corrupt the blockchain, and actually potentially permanently cause divergence )

simon
2016-06-07 16:41
but we're set to "syncing"

jyellick
2016-06-07 16:41
Where?

simon
2016-06-07 16:41
``` if instance.skipInProgress { logger.Debug("Replica %d currently picking a starting point to resume, will not execute", instance.id) return false } ```

jyellick
2016-06-07 16:41
Right, but in the view change, where do we set `instance.skipInProgress`?

simon
2016-06-07 16:42
aha!

simon
2016-06-07 16:42
view change

simon
2016-06-07 16:42
also

simon
2016-06-07 16:42
what happens if we are executing

simon
2016-06-07 16:43
and we trigger a state transfer

simon
2016-06-07 16:43
they will execute concurrently

jyellick
2016-06-07 16:43
That does seem like a potential race

jyellick
2016-06-07 16:43
This is why the executor combined the two

simon
2016-06-07 16:44
maybe now it is time to build a small executor using the events system

jyellick
2016-06-07 16:44
I wonder if it doesn't belong outside of obcpbft, like state transfer

simon
2016-06-07 16:44
and state transfer can probably also unwind by using events

simon
2016-06-07 16:44
yea it does

jyellick
2016-06-07 16:46
I am trying to remember, what execution path work was it you were going to tackle?

simon
2016-06-07 16:47
the preview exec + commit on checkpoint?

jyellick
2016-06-07 16:48
Ah, yes, that's right

jyellick
2016-06-07 16:49
Wondering which work should be done first

simon
2016-06-07 16:51
bug fix over features

jyellick
2016-06-07 16:53
Fair enough. As I know the state transfer semantics well, want me to tackle the mini-executor then?

simon
2016-06-07 16:59
and with it the commit on viewchange race?

jyellick
2016-06-07 17:02
Right

simon
2016-06-07 17:08
okay

jyellick
2016-06-07 17:23
@simon: Are you still handy?

simon
2016-06-07 17:23
i am

jyellick
2016-06-07 17:23
Designing the API for the executor

jyellick
2016-06-07 17:23
Since the executor must necessarily have its own thread for performing executions (and we don't want to block until they complete)

jyellick
2016-06-07 17:24
I was trying to decide how we should pass back the result to a call like `Preview` or `Commit`

jyellick
2016-06-07 17:25
The most simple/direct approach is to supply a callback which will be invoked with the return value, but it's not the most intuitive of APIs

jyellick
2016-06-07 17:25
Something like: `Preview(callback func(uint64, *pb.Block))` or `Execute(txs []*pb.Transaction, callback func())`

jyellick
2016-06-07 17:25
Thoughts?

simon
2016-06-07 17:26
yea, thinking

simon
2016-06-07 17:27
would it be enough for it to emit an event?

simon
2016-06-07 17:27
i suppose for decoupling that would be a callback

jyellick
2016-06-07 17:28
Right, ultimately it will be converted back to an event

jyellick
2016-06-07 17:29
The other option would be to specify a callback receiver at instantiation

jyellick
2016-06-07 17:29
That would make the API usage a little more straightforward I would think, though it would reduce the flexibility a bit.

simon
2016-06-07 17:29
i think supplying an interface is more idiomatic?

simon
2016-06-07 17:29
on instantiation

jyellick
2016-06-07 17:30
Okay, that's fine with me

simon
2016-06-07 17:30
but then you have the problem with concurrent bringup

jyellick
2016-06-07 17:30
It does make instantiation a little annoying, but, because there's a `Start()`, it should be safe

simon
2016-06-07 17:32
i wonder what the idiomatic way is

jyellick
2016-06-07 17:34
As do I, seems like there must be a better pattern

jyellick
2016-06-07 18:13
@simon: Still around?

simon
2016-06-07 18:15
yes

jyellick
2016-06-07 18:17
So, I've tracked down the problem @bcbrock has been having with 'empty' blocks. Basically, in the execution loop, we mark all of the requests as stale, and end up with a slice of transactions which is 0 length.

jyellick
2016-06-07 18:18
(In this case, there was only 1 transaction to begin with)

simon
2016-06-07 18:18
oh

simon
2016-06-07 18:18
how come

jyellick
2016-06-07 18:18
So, I'm wondering two things, one, what should we do if there are no valid transactions in a batch? I would think we should not write a block.

simon
2016-06-07 18:19
why did the primary include stale requests?

jyellick
2016-06-07 18:19
And secondly, it seems like the primary including stale transactions is a bug

simon
2016-06-07 18:19
probably should trigger view change

jyellick
2016-06-07 18:19
Right, as it would indicate byzantine behavior

simon
2016-06-07 18:20
primary should sort requests by the same sender

jyellick
2016-06-07 18:20
So, I think the way this is happening...

jyellick
2016-06-07 18:20
Is that the primary is receiving requests from REST with out of order timestamps

jyellick
2016-06-07 18:21
Because if there are concurrent requests to be delivered into consensus, it is not FIFO, it is pseudorandom (per standard channel writer behavior)

jyellick
2016-06-07 18:22
Or wait, maybe not...

jyellick
2016-06-07 18:22
We make a new timestamp it looks like

simon
2016-06-07 18:43
yes we do

jyellick
2016-06-07 18:50
The leader is receiving stale requests from itself

cca
2016-06-07 18:53
hi guys - i had this open and was reading

cca
2016-06-07 18:53
FIFO is not required for BFT... but if it is missing, clients often wonder what happens

cca
2016-06-07 18:53
and even write papers .... https://arxiv.org/abs/1605.05438 ...

cca
2016-06-07 18:54
so if you can support it without much cost, then it makes a lot of sense

simon
2016-06-07 18:54
hi cca

jyellick
2016-06-07 18:58
I think even if we were to not use the psuedorandom channel writer stuff, FIFO would still be difficult to promise, as requests can come in concurrently

simon
2016-06-07 18:59
cca: i think these guys didn't read the nakamoto paper?

simon
2016-06-07 19:00
just because a block has been mined doesn't mean that the data should be considered "committed"

cca
2016-06-07 19:06
FIFO is defined as per-sender order

cca
2016-06-07 19:06
can be implemented in the obvious way, with a sequence number stored at the sender (without a concept of a sender, it isn't defined)

cca
2016-06-07 19:08
simon: they go into the depth of the chain, and parameterize what is decided by the depth. you have to make some choice like this, in nakamoto consensus -- otherwise, if i buy my house using bitcoin, then the chain reverts and forks to something else, do i have to move out?

jyellick
2016-06-07 19:17
@simon: Are you seeing how the primary could be submitting stale requests to itself?

jyellick
2016-06-07 19:33
Aha! Found it

vukolic
2016-06-07 20:01
re commit on checkpoints as 1545 turns out non-deterministic and 1701 bogus use of noops - I am more convinced that we should not do it

vukolic
2016-06-07 20:01
there is no use in masking non-determinism (sometimes)

vukolic
2016-06-07 20:04
that said - we desperately need to help chaincode developers not write non-deterministic chaincode - and I am not aware that anybody is looking into how to do this

jyellick
2016-06-07 20:10
@vukolic: I think the converse argument would be that committing on checkpoint would generally reveal nondeterminism more immediately, the blockchain should halt in a consistent state, and the set of trans which caused the non-determinism would be obvious

vukolic
2016-06-07 21:02
actually it wouldn't

vukolic
2016-06-07 21:02
if non determinism appears in one peer then it could be masked

vukolic
2016-06-07 21:02
anyway - for the record - I am not in support of that

vukolic
2016-06-07 21:04
one can have at a checkpoint detection that we diverged from others

vukolic
2016-06-07 21:04
and shut down the machine - that would be ok

vukolic
2016-06-07 21:05
I am more in favor of debugging Sieve and/or moving to Cons v2 to address non-determinism

vukolic
2016-06-07 21:05
addressing it incompletely is not satisfactory IMO

jyellick
2016-06-08 03:20
Yes, for non-determinism which is exhibited in less than f peers, it would not be detected, but under byzantine conditions, I believe it is provably not solvable (including under Sieve).

jyellick
2016-06-08 03:23
@simon @tuand @kostas https://github.com/hyperledger/fabric/pull/1749 This should fix the empty blocks that @bcbrock has been observing under batch. Essentially, when `pbft-core.go` runs out of sequence numbers, it begins buffering requests in a map, and when the watermarks move, it resubmits the requests in map iterator order (which is effectively random). Because the deduplicator filters out 'old' requests, we end up with blocks which contain no transactions (and end up abandoning requests which we should not).

yingfeng
2016-06-08 04:46
@ghaskins: I've applied the patch of PR 1708, and now, on each machine, here are the results of `docker images`: ``` REPOSITORY TAG IMAGE ID CREATED SIZE hyperledger/fabric-peer latest 820b0b3235d8 21 minutes ago 1.443 GB hyperledger/fabric-ccenv latest 87324f87a686 21 minutes ago 1.433 GB hyperledger/fabric-src latest f68c00309ee5 21 minutes ago 1.416 GB hyperledger/fabric-baseimage latest 43574ff03f31 43 minutes ago 1.384 GB ``` Now I start 4 peer nodes on 4 different machines, with the consensus configuration of pbft: `plugin:pbft` in `peer/core.yaml` and `mode batch` in `consensus/obcpbft/config.yaml` The deployment behavior is different from before---it returns immediately, I still use the chaincode example2 as the chaincode deployment test. However, after deployment, I could not get successful query: ``` curl -H "Content-Type: application/json" -X POST --data "@query.json" -k http://192.168.0.147:5000/chaincode {"jsonrpc":"2.0","error":{"code":-32003,"message":"Query failure","data":"Error when querying chaincode: Error:Failed to launch chaincode spec(Could not get deployment transaction for 7b07c59e9b9405c1aef33493b63b9a766d9bb836989ded1730052de650aa8ce5654274d148ceff96a4e5bd43bca26aba099f55c400e4befdc8b2ee4c0a94e30b - LedgerError - ResourceNotFound: ledger: resource not found)"},"id":5} ``` The logs of 4 machines are uploaded here: https://transfer.sh/15kVnV/logs.tar.gz

simon
2016-06-08 11:29
yingfeng: they create containers

simon
2016-06-08 11:29
certainly not a consensus issue

yingfeng
2016-06-08 11:46
@simon: then what does it mean? is it an extra bug?

simon
2016-06-08 11:46
i have no idea, i can't see the problem in the log

simon
2016-06-08 11:47
wait

simon
2016-06-08 11:47
are you using the wrong chaincode id?

yingfeng
2016-06-08 11:48
ah you r right.. it seems different chaincode id will be returned when applying PR 1708?

simon
2016-06-08 11:50
no idea

jyellick
2016-06-08 15:38
@simon: What is the correct custodial reaction to state transfer?

simon
2016-06-08 15:47
throw away everything, i think

simon
2016-06-08 15:47
or re-introduce them and rely on some other subsystem to filter replays

simon
2016-06-08 15:47
but that subsystem doesn't exist

simon
2016-06-08 15:54
jyellick: how come 1741 is happening?

simon
2016-06-08 15:54
shouldn't we have observed this behavior previously?

jyellick
2016-06-08 15:56
@simon: I hadn't been following 1741 too closely, have been trying to figure out the other half of 1091

jyellick
2016-06-08 15:58
Looking at the sequence numbers, it's very suspicious that the network stalls at seqNo=20

simon
2016-06-08 15:58
well yes

simon
2016-06-08 15:59
it should get a checkpoint

simon
2016-06-08 15:59
but without debug info, i can't tell whether nodes are trying to send checkpoints, etc.

jyellick
2016-06-08 15:59
But yes, I've successfully run tens of thousands of transactions through on the defaults

jyellick
2016-06-08 15:59
I don't see why we would suddenly have no checkpoints

simon
2016-06-08 16:00
default is K=40, L=10?

simon
2016-06-08 16:01
vp3 is not working properly, but that should be fine

jyellick
2016-06-08 16:05
K=10, L=40, but primary will not order requests beyond L/2 to prevent thrashing

simon
2016-06-08 16:11
yea

simon
2016-06-08 16:12
i don't understand why the primary doesn't seem to get checkpoints and then continue

simon
2016-06-08 16:14
so i have a first code for the broadcaster, but it is so ugly


crow15
2016-06-08 16:20
has joined #fabric-consensus-dev

plucena
2016-06-08 17:36
Hi there. Is there any plan to support any proof of stake algorithm on Hyperledger?


plucena
2016-06-08 18:18
Thanks a lot @tuand

simon
2016-06-09 13:28
jyellick: i've been thinking about the age filtering

simon
2016-06-09 13:28
jyellick: i think we shouldn't filter on execute

simon
2016-06-09 13:29
there is a chance that we are not synced (because we did state transfer), and will not reject a transaction that others will reject

simon
2016-06-09 13:29
jyellick: probably we need to reject the block on prepare

jyellick
2016-06-09 13:29
Yes, I was just about to suggest that

jyellick
2016-06-09 13:30
It should be easy enough to do too in the context of the new events stuff, simply have batch catch the PrePrepare and filter it out if it is about a stale request

simon
2016-06-09 13:31
and trigger view change

simon
2016-06-09 13:31
yea

simon
2016-06-09 13:32
i'm still on 1741 - something is making the system non-deterministic, and i don't know what

simon
2016-06-09 13:32
what could it be?

jyellick
2016-06-09 13:33
Let me grab the logs and take a look

simon
2016-06-09 13:33
well, there are no buckettree/ledger infos

jyellick
2016-06-09 13:33
My best guess is, since this is a system no-op chaincode, that it has something to do with the exec go routine returning faster than usual

jyellick
2016-06-09 13:34
(Looking at the logs now)

simon
2016-06-09 13:34
unfortunately you can only see the divergence at the checkpoint

jyellick
2016-06-09 13:38
All of the sequence number and execute digests match

simon
2016-06-09 13:39
yep

simon
2016-06-09 13:39
super weird

jyellick
2016-06-09 13:39
I'd say 'non-deterministic chaincode'

jyellick
2016-06-09 13:39
But could also be some sort of ledger bug, not sure why there are no logs for it

jyellick
2016-06-09 13:40
But I'd agree, there's nothing that looks like it's going 'wrong' in PBFT

simon
2016-06-09 13:41
the chaincode doesn't do anything

simon
2016-06-09 13:41
so that can't be it either

jyellick
2016-06-09 13:44
Just commented on that issue in support

simon
2016-06-09 17:53
yeah figured out 1741 - not our fault

tuand
2016-06-09 17:57
just saw your #1741 comment :+1:

muralisr
2016-06-09 18:02
I had @manish-sethi question too

vukolic
2016-06-10 09:51
so #1741 = non-det genesis blocks?

vukolic
2016-06-10 09:51
bcs timestamps?

simon
2016-06-10 09:56
yep

ghaskins
2016-06-10 11:03
is the thought that each node will generate its own genesis block?

simon
2016-06-10 11:15
that or for provisioning, the genesis block is created in one place and imported on all nodes, i'd say

simon
2016-06-10 11:15
but i don't think there is any plan for that

simon
2016-06-10 11:15
i don't think anybody even thought about this

ghaskins
2016-06-10 11:15
thats hows its done for most blockchains and I assumed would be the case here too

ghaskins
2016-06-10 11:16
personally, i think its fine to assume that…so if that simplifies anything w.r.t. 1741...

simon
2016-06-10 11:16
i'd say it is a requirement

simon
2016-06-10 11:17
and all settings need to be driven from that genesis block

simon
2016-06-10 11:17
all settings that influence chain creation

ghaskins
2016-06-10 11:17
totally agree

ghaskins
2016-06-10 11:18
though it might be nice to properly delineate settings that influence genesis block creation and settings that are used dynamically over the lifetime of the system

ghaskins
2016-06-10 11:18
(if there are any)

ghaskins
2016-06-10 11:19
im thinking something like the seed/root nodes needs to be dynamic minimally

simon
2016-06-10 11:19
i'd say the genesis block shouldn't be created implicitly

simon
2016-06-10 11:19
but explicitly, and imported explicitly

ghaskins
2016-06-10 11:19
agreed, it should be an explicit operation

ghaskins
2016-06-10 11:19
yep

ghaskins
2016-06-10 11:21
i havent looked, but i am assuming that there isnt currently an external representation of a block?

ghaskins
2016-06-10 11:22
(i.e. the system detects that the db is empty is emits a genesis block straight to the ledger)


muralisr
2016-06-10 12:22
@simon @ghaskins with the Lifecycle work we were considering not having system cc depoy transaction on the ledger at all. (especially for “upgrade” issues)

muralisr
2016-06-10 12:23
do you think - at least for now - we should go ahead with that plan ? Basically treat system chaincode the same way we would treat the fabric itself

simon
2016-06-10 12:24
i don't know what you mean

simon
2016-06-10 12:24
what is the terminology?

simon
2016-06-10 12:24
ledger is blockchain?

simon
2016-06-10 12:24
or ledger is blocks?

simon
2016-06-10 12:24
or state?

muralisr
2016-06-10 12:26
system chaincode deploy transaction does not go through consensus. Each peer brings it up outside of consensus. So we can take the next step and not write the deploy transaction on the genesis block

muralisr
2016-06-10 12:28
by doing that we are treating the system chaincode as part of the fabric in some sense. So “upgrade” of sys cc would have the same considerations as upgrade of the fabric itself

simon
2016-06-10 12:29
but how do you invoke system chaincode then?

simon
2016-06-10 12:29
how do you know it is enabled?

muralisr
2016-06-10 12:30
invoke will work or fail depending on the sys cc is installed or not, no ?

muralisr
2016-06-10 12:31
I have a branch where I was playing with the sys cc not being on the block

simon
2016-06-10 12:33
what determines whether it is registered or not?

muralisr
2016-06-10 12:39
it is registered if its in the core.yaml and hooked up via code

muralisr
2016-06-10 12:39
and if it is not, invokes and queries will fail

simon
2016-06-10 12:44
yea no

simon
2016-06-10 12:44
we need to make sure that all replicas are the same

simon
2016-06-10 12:45
it's fine as a prototype to have it in core.yaml

simon
2016-06-10 12:45
but for production, this needs to come from the ledger itself

simon
2016-06-10 12:45
which chaincode is enabled

simon
2016-06-10 12:45
otherwise different peers might run different chaincode

muralisr
2016-06-10 12:48
yea, I can see how that might be a good separation.

muralisr
2016-06-10 12:49
that’s a different problem from having the dep. transaction on the block

simon
2016-06-10 12:49
so maybe we don't need a deploy transaction

muralisr
2016-06-10 12:49
yeah

simon
2016-06-10 12:49
but we need some form of registration

muralisr
2016-06-10 12:49
yea

simon
2016-06-10 12:49
didn't we want to move chaincode into the state?

muralisr
2016-06-10 12:50
yes

muralisr
2016-06-10 12:50
the dep transaction

muralisr
2016-06-10 12:51
@manish-sethi already did that work when we played with the things needed for “life cycle"

muralisr
2016-06-10 12:51
(of course not in the main branch)

simon
2016-06-10 12:52
i don't quite understand how we can remove the data from the deploy transaction

simon
2016-06-10 12:53
i guess it would be both in the transaction, and in the state

jyellick
2016-06-10 13:29

hgabor
2016-06-10 15:10
has joined #fabric-consensus-dev

simon
2016-06-10 17:11
jyellick: you around?

jyellick
2016-06-10 17:12
I am

jyellick
2016-06-10 17:12
(@simon)

simon
2016-06-10 17:12
hi

simon
2016-06-10 17:12
so i'm trying to test the broadcaster stuff

simon
2016-06-10 17:12
and something is odd with state transfer

simon
2016-06-10 17:13
it transfers to a point, and it has entries for the rest in its certstore, but i think it is missing a commit cert

simon
2016-06-10 17:13
and then it just sits there

simon
2016-06-10 17:13
sending prepares and commits

simon
2016-06-10 17:13
but unable to execute

simon
2016-06-10 17:14
not sure whether this is expected or not

jyellick
2016-06-10 17:15
Hmm, so, I suppose it is possible that we request transfer to a point, but we've already missed some messages and will basically need to wait for state transfer to trigger again

simon
2016-06-10 17:15
yea i guess that would be it

jyellick
2016-06-10 17:15
I'm not really sure how to avoid that. We pre-emptively move our watermarks before picking a point to state transfer to, in the hope of capturing all needed messages

jyellick
2016-06-10 17:17
But that's really just best effort, I don't think there's any way to guarantee that we've not missed any messages, in practice, I've never seen it fail, where are you seeing this?

simon
2016-06-10 17:18
pausing a peer

simon
2016-06-10 17:18
and then unpausing

simon
2016-06-10 17:18
a subset of messages replays from tcp buffers

simon
2016-06-10 17:18
and other buffers

jyellick
2016-06-10 17:21
I think this is expected then

jyellick
2016-06-10 17:25
I am trying to fix up complaints, because perceived stale requests end up getting dropped, and it is causing problems for us. I am trying to move the deduplicator `Execute` check to filter out pre-prepares with stale requests, but I'm not sure how to reset the de-duplicator on view change, as it's valid to pre-prepare the same request multiple times in multiple views.

simon
2016-06-10 17:27
yea

simon
2016-06-10 17:27
well

simon
2016-06-10 17:27
i've been there several times, it is difficult

simon
2016-06-10 17:31
oh it seems that the primary isn't complaining to itself, so it will never send a complaint view change

simon
2016-06-10 17:32
`[31m17:19:56.433 [consensus/obcpbft] processMessage -> ERRO 2456ESC[0m Unknown request: request:<timestamp:<seconds:1465579186 nanos:339740631 > payload:...`

simon
2016-06-10 17:32
wut

jyellick
2016-06-10 17:35
Yes, I've seen this....

simon
2016-06-10 17:37
`[33m17:20:12.097 [consensus/handler] SkipTo -> WARN 1104ESC[0m State transfer is being called for, but the state has not been invalidated`

jyellick
2016-06-10 17:38
Is this in view change?

jyellick
2016-06-10 17:40
Yes, it looks like the view change state transfer code was not updated to mark the state as invalid, should be an easy one line fix

jyellick
2016-06-10 17:41
All this code needs a serious overhaul, trying to minimize diffs by leaving in old interfaces is causing lots of cruft to build up

jyellick
2016-06-10 17:49
Oh, got it

jyellick
2016-06-10 17:50
Those `Unknown request` messages are benign

jyellick
2016-06-10 17:50
If we're not the leader, we ignore the `Request` message, and fall through to the end

jyellick
2016-06-10 17:50
I'll fix

jyellick
2016-06-10 17:58
(basically when one replica is in the wrong view, and thinks a backup is the primary, the backup will spew those messages)

simon
2016-06-10 18:08
aha

simon
2016-06-10 19:21
oh now i'm running into stale requests -_-

jyellick
2016-06-10 19:45
Yeah... am trying to clean that up

jyellick
2016-06-10 19:46
Wonder if we shouldn't just broadcast new requests into the network (not just to the primary), and count on the fact that since we have a periodic view change, they will all eventually be executed

jyellick
2016-06-10 19:46
This deduplication complaining stuff is complicated

simon
2016-06-10 19:52
yea

simon
2016-06-10 19:53
fine with me


sbrakev
2016-06-10 21:04
has joined #fabric-consensus-dev

jyellick
2016-06-10 23:08
Tagged you in it, but here is a PR without complaints in batch, https://github.com/hyperledger/fabric/pull/1798

yingfeng
2016-06-12 01:40
I am using the code after PR 1774 has been merged for stress testing, and I've setup a 4 nodes' environment, with a configuration of `plugin:pbft` and `mode:classic` The jmeter clients show a relative high TPS: ``` Waiting for possible shutdown message on port 4445 summary + 25811 in 9s = 2876.5/s Avg: 488 Min: 3 Max: 738 Err: 0 (0.00%) Active: 2000 Started: 2000 Finished: 0 ``` However, after I stopped jmeter, the system still run for a long time(I've let jmeter run only a few seconds, but the system will last around 5 minutes to stop), it seems the requests have been buffered a lot, and then send to consensus, as a result, the practical upper bound of TPS is far low than the number shown as above. Actually, if I changed the config from `mode:classic` to `mode:batch`, the TPS results shown from jmeter is even higher, say 7000 TPS, but I need to wait much more time for the system to stop after I stopped jmeter program. So, what's the practical TPS for fabric to achieve? It's actually a kind of asynchronous processing, so fabric will not tell clients whether it has been out of processing limit, in this case, the client will never know whether his request will be success in practical(even if the requests are discarded due to the full of buffer internal, the client will never know)

yingfeng
2016-06-12 07:45
Another issue of the above stress test is, after I setup 4 peer nodes and use jmeter to send requests to a peer, there are two nodes whose logs would increase continuously without stopping, even if the stress test only last for a few seconds: peer0,peer1,peer2,peer3, the logs of peer0 and peer2 keep increasing for hours, and have reaches to tens of gigabytes. Peer2 is the node accepting jmeter requests. Here are some log snippets of peer0 and peer2: ``` 07:41:42.648 [peer] beforeSyncBlocks -> WARN 1d53762 Ignoring SyncBlocks message with correlationId = 4982701, blocks 4184 to 4184, as current correlationId = 4982702 07:41:42.648 [peer] beforeSyncBlocks -> WARN 1d53763 Ignoring SyncBlocks message with correlationId = 4982701, blocks 4183 to 4183, as current correlationId = 4982702 07:41:42.648 [peer] beforeSyncBlocks -> WARN 1d53764 Ignoring SyncBlocks message with correlationId = 4982701, blocks 4182 to 4182, as current correlationId = 4982702 07:41:42.648 [peer] beforeSyncBlocks -> WARN 1d53765 Ignoring SyncBlocks message with correlationId = 4982701, blocks 4181 to 4181, as current correlationId = 4982702 07:41:42.648 [peer] beforeSyncBlocks -> WARN 1d53766 Ignoring SyncBlocks message with correlationId = 4982701, blocks 4180 to 4180, as current correlationId = 4982702 07:41:42.648 [peer] beforeSyncBlocks -> WARN 1d53767 Ignoring SyncBlocks message with correlationId = 4982701, blocks 4179 to 4179, as current correlationId = 4982702 07:41:42.648 [peer] beforeSyncBlocks -> WARN 1d53768 Ignoring SyncBlocks message with correlationId = 4982701, blocks 4178 to 4178, as current correlationId = 4982702 07:41:42.649 [consensus/statetransfer] tryOverPeers -> WARN 1d53769 name:"vp1" in tryOverPeers loop trying name:"vp0" : name:"vp1" got block 4188 from name:"vp0" with hash 1458528567ed10981616468b50bc1754416e4388a871a848431b5e4bcf7e0470a5aaee4c978fc57d59db3c3e5adf8a407444f3210f0f304740ea2984ebcdf3f9, was expecting hash ffc3496f8d3cec47fa664a848dff85a4b05f0de8d2dd76594d920680a831faa45af0d955ca2892462623d78883fcbfca05346add57a7cff84f7492238f5d705d 07:41:42.650 [consensus/statetransfer] tryOverPeers -> WARN 1d5376a name:"vp1" in tryOverPeers loop trying name:"vp2" : name:"vp1" got block 4188 from name:"vp2" with hash 1458528567ed10981616468b50bc1754416e4388a871a848431b5e4bcf7e0470a5aaee4c978fc57d59db3c3e5adf8a407444f3210f0f304740ea2984ebcdf3f9, was expecting hash ffc3496f8d3cec47fa664a848dff85a4b05f0de8d2dd76594d920680a831faa45af0d955ca2892462623d78883fcbfca05346add57a7cff84f7492238f5d705d 07:41:42.651 [consensus/statetransfer] tryOverPeers -> WARN 1d5376b name:"vp1" in tryOverPeers loop trying name:"vp3" : name:"vp1" got block 4188 from name:"vp3" with hash 1458528567ed10981616468b50bc1754416e4388a871a848431b5e4bcf7e0470a5aaee4c978fc57d59db3c3e5adf8a407444f3210f0f304740ea2984ebcdf3f9, was expecting hash ffc3496f8d3cec47fa664a848dff85a4b05f0de8d2dd76594d920680a831faa45af0d955ca2892462623d78883fcbfca05346add57a7cff84f7492238f5d705d ```

jyellick
2016-06-12 17:14
@yingfeng Is this pbft batch? What these messages indicate, is that vps0,2,3 all agree on the hash `1458528567ed10981616468b50bc1754416e4388a871a848431b5e4bcf7e0470a5aaee4c978fc57d59db3c3e5adf8a407444f3210f0f304740ea2984ebcdf3f9` for block 4188, but for some reason, vp1 believes the hash to be `ffc3496f8d3cec47fa664a848dff85a4b05f0de8d2dd76594d920680a831faa45af0d955ca2892462623d78883fcbfca05346add57a7cff84f7492238f5d705d`, this is causing the peer to try and retry and retry and so forth to retrieve that block, and constantly fail, which is causing the error flood you see in the logs In order to figure out why vp1 believe in the wrong hash, I would need to see logs from earlier on

jyellick
2016-06-12 17:16
Also, for stress testing of pbft batch, I highly suggest you include PR 1798, as this fixes some known bugs which are related to stress.

yingfeng
2016-06-13 00:42
@jyellick: it's pbft classic

tuand
2016-06-13 01:07
@yingfeng: could you give us debug logs ? Or show us how we can set up a client to reproduce the test you ran? And can you create an issue for this?

yingfeng
2016-06-13 01:10
@tuand: the logs are discarded since the disk was full.. I will create an issue and attach the jmeter file

jyellick
2016-06-13 01:25
@yingfeng: pbft classic is pending deprecation, please use batch with a batch size of 1 to emulate classic

yingfeng
2016-06-13 01:25
@jyellick: got it, thanks~

yingfeng
2016-06-13 04:26
@jyellick: I applied PR 1798, and set config to `pbft batch` with batch size of `2`, and reproceed the stress testing, the above issue does not appear any more. I just use jmeter to send around 290K invoke requests to chaincode through REST api in around 30 seconds, here's the output of jmeter: ``` root@75df16cca62c:/# jmeter -n -t fabric.jmx Creating summariser <summary> Created the tree successfully using fabric.jmx Starting the test @ Mon Jun 13 03:49:54 UTC 2016 (1465789794744) Waiting for possible shutdown message on port 4445 summary + 37166 in 5s = 7869.2/s Avg: 107 Min: 0 Max: 320 Err: 0 (0.00%) Active: 2000 Started: 2000 Finished: 0 summary + 260766 in 30.2s = 8633.8/s Avg: 230 Min: 179 Max: 443 Err: 0 (0.00%) Active: 2000 Started: 2000 Finished: 0 summary = 297932 in 35s = 8580.2/s Avg: 214 Min: 0 Max: 443 Err: 0 (0.00%) ``` Around 8000 TPS. However, these requests had been accumulated for half an hour to be processed. Peer0-3, Peer0 is the `CORE_PEER_DISCOVERY_ROOTNODE`, while Peer3 is the node accepting jmeter requests. Logs from both these two nodes have reached up to 3GB and the log level is set to INFO, logs for the other two nodes are only around 40M

yingfeng
2016-06-13 04:48
The average TPS is around 30 per node, because although 290K requests were sent to a single node, only 22K requests had been successfully processed, others are discarded due to buffer's full, and it took 10min for those 22k requests to be processed.

yingfeng
2016-06-13 05:03
When I adjust the batch size of pbft from 2 to 1000, the above performance metric does not change a lot.

simon
2016-06-13 10:41
no

jyellick
2016-06-13 12:57
@yingfeng: Remember that Invoke is an asynchronous call, submitting 8000 TPS for a half hour may queue all requests, but if the underlying chaincodes are only able to be executed at 30TPS, then no amount of batching etc. will help. It sounds like you have hit a bottleneck which is not related to consensus, but rather chaincode execution.

simon
2016-06-13 13:04
we really need to make this closed loop

simon
2016-06-13 13:05
but i don't know how

jyellick
2016-06-13 13:30
I don't think you'll find support for that from the distributed guys

jyellick
2016-06-13 13:31
Really, the simpler solution to me would be to start rejecting requests once our queue is X full

simon
2016-06-13 13:32
but what queue

jyellick
2016-06-13 13:32
Well, we need a queue

simon
2016-06-13 13:32
:slightly_smiling_face:

simon
2016-06-13 13:32
but we forward requests to the primary

jyellick
2016-06-13 13:32
We do, so hold them in the queue until they are executed?

simon
2016-06-13 13:32
okay

simon
2016-06-13 13:32
how large is the queue then?

jyellick
2016-06-13 13:33
I'd say configurable, but, a few thousand seems like a reasonable first guess

simon
2016-06-13 13:33
:slightly_smiling_face:

simon
2016-06-13 13:33
hey slack, stop replacing my inverse smileys

simon
2016-06-13 13:42
jyellick: do you know why batch main loop keeps looping without printing anything else?

jyellick
2016-06-13 13:44
Nothing else at all? Is this a unit test or live?

simon
2016-06-13 13:44
test

simon
2016-06-13 13:45
TestNetworkBatch is racy

simon
2016-06-13 13:45
should i just add a timeout?

simon
2016-06-13 13:45
well, a sleep

jyellick
2016-06-13 13:48
Which piece of it is racy?

jyellick
2016-06-13 13:48
In unit tests, we send `nil` events to basically flush the event thread (make sure it has finished processing the last event we gave it)

simon
2016-06-13 13:51
so what happens is this:

simon
2016-06-13 13:51
``` [36m13:28:09.132 [consensus/obcpbft/custodian] Register -> DEBU 06b[0m Registering EPeABAdNkwfyAX8LB2i+fuLedvzVDNDhcYscmFYymcBuCc8i3ngXIqmnVVUOFv3Dr+pA1ZE5NPzGeUBjwkKtig== into custody with timeout 2016-06-13 13:28:11.132112257 +0200 CEST 13:28:09.132 [consensus/obcpbft] processMessage -> INFO 06c[0m Batch replica 2 received new consensus request: EPeABAdNkwfyAX8LB2i+fuLedvzVDNDhcYscmFYymcBuCc8i3ngXIqmnVVUOFv3Dr+pA1ZE5NPzGeUBjwkKtig== --- FAIL: TestNetworkBatch (0.10s) obc-batch_test.go:65: 0 messages expected in primary's batchStore, found [timestamp:<seconds:1465817289 nanos:31448813 > payload:"\010\001\032\00112\002\010\001" replica_id:1 ] ```

simon
2016-06-13 13:52
that's this:

simon
2016-06-13 13:52
``` err = net.endpoints[2].(*consumerEndpoint).consumer.RecvMsg(createOcMsgWithChainTx(2), broadcaster) net.process() ```

simon
2016-06-13 13:53
the process somehow doesn't last long enough for the message to be sent to the primary, which then would create a new batch block

jyellick
2016-06-13 13:55
Hmmm, that's a single threaded path I think? I don't see how it could process that nil event without having already queued a message

jyellick
2016-06-13 13:55
Is it possible that you hit the batch timeout on the first request?

jyellick
2016-06-13 13:55
(So that the second request didn't fill the batch size)

simon
2016-06-13 14:07
nope

simon
2016-06-13 14:10
ah no, interesting. replacing the fatalf with errorf, it turns out that the batch replica never receives the message?

simon
2016-06-13 14:10
hm.

simon
2016-06-13 14:11
ah nm, i need to process more

simon
2016-06-13 14:28
jyellick: so if i call process() twice, it works :slightly_smiling_face:

simon
2016-06-13 14:30
haha

simon
2016-06-13 14:30
what a hack

jyellick
2016-06-13 14:30
Hmmm, it should be harmless to call it 'extra', but that's odd. We still need to just fix the test framework to be entirely deterministic

simon
2016-06-13 14:30
yes

jyellick
2016-06-13 14:31
Obviously lower priority than fixing actual bugs in the real code, but hopefully we can find some time to do this after this June release

jyellick
2016-06-13 14:31
Or maybe during the period of code stabilization, merging new tests should be safe

simon
2016-06-13 14:32
yes

jyellick
2016-06-13 17:17
@simon: Are you around?

simon
2016-06-13 17:40
i am

simon
2016-06-13 19:00
seems that scheduling latency for me is in the order of 32us

jyellick
2016-06-13 19:01
@simon What do you think should be the procedure when we get a view change, and have all the commit certs we need to reach a checkpoint, but do not have it yet?

jyellick
2016-06-13 19:01
(Whenever I try to turn on the periodic view change stuff, invariably, it triggers a ton of state transfer as one of the replicas is necessarily the slowest, and gets told to change views before it can reach a checkpoint)

simon
2016-06-13 19:14
we should execute instead of state transfer

jyellick
2016-06-13 19:22
My concern is that this is off paper, and we might not have enough room in our execution window say for everything in the Xset

grapebaba
2016-06-14 04:45
has joined #fabric-consensus-dev

zuowang
2016-06-14 11:51
has joined #fabric-consensus-dev

jyellick
2016-06-14 12:11
@simon: @kostas @vukolic What do you think of the correctness of replying with a VIEW-CHANGE immediately if it is the primary of the view who sends it?

simon
2016-06-14 12:12
what does this address?

simon
2016-06-14 12:12
i guess that would be correct

kostas
2016-06-14 12:14
I'm also wondering what this addresses

jyellick
2016-06-14 12:16
Just helps with the liveliness of the network, if the primary has sent a view change, then it is either in a new view, or byzantine, we should move on

jyellick
2016-06-14 12:17
In particular, the primary times out waiting for a reply to its pre-prepare, and switches views

jyellick
2016-06-14 12:18
The rest of the network prepares/commits that request, and then thinks the world is good

jyellick
2016-06-14 12:19
Definitely an optimization and not a correctness thing. Just see this frequently in the busywork tests

thiruworkspace
2016-06-14 12:27
has joined #fabric-consensus-dev

jyellick
2016-06-14 12:33
FYI all, I've got that class today unfortunately, so I'll have very limited availability throughout the day

cca
2016-06-14 12:35
@jyellick: does this VIEW-CHANGE optimization take place in the context of the PBFT algorithm as in the TOCS paper (p.411)? If yes, then I would note that the leader doesnt ever send a VIEW-CHANGE there, neither in the figure nor in the text. It just remains correct and satisfied by itself.

jyellick
2016-06-14 14:20
@cca: Then our code is wrong, as the leader will send view changes based on request timers, or it's own failure to generate a new view

jyellick
2016-06-14 14:26
@simon: @kostas I can fix the above issue, but likely won't be able to get to it until tomorrow, if one of you has a chance and chooses to submit a PR, please make sure you base it off of https://github.com/hyperledger/fabric/pull/1798

simon
2016-06-14 14:40
jyellick: you mean add view change when the primary does?

cca
2016-06-14 15:33
jyellick: i would not think "wrong", it seems there is no harm except for unnecessary view changes. but not exactly like in the paper and perhaps unnecessarily cautious. certainly, the suggested fix shouldn't happen, because the problem is better dealt with by eliminating the source.

jyellick
2016-06-14 16:06
@cca @simon I meant fix by preventing the leader from sending timeout based view change messages. @cca Are there exceptions to this? Should the primary still respond with VIEW-CHANGE when it received f+1 VIEW-CHANGE messages (I would think so)?

jyellick
2016-06-14 16:07
Maybe we should hold off on this until after this June freeze, the code as written seems to be working, and unless it fixes a critical bug, it might not be worth the potential regressions.

simon
2016-06-14 16:07
i don't understand what you mean by timeout based view change messages

simon
2016-06-14 16:07
are you saying that the primary shouldn't maintain a view change timer?

jyellick
2016-06-14 16:07
For instance, request timeouts. The primary sends a pre-prepare, and if it is not committed within the timeout window, then it sends a view change.

jyellick
2016-06-14 16:07
Per @cca it sounds like the primary should _not_ send under this scenario.

jyellick
2016-06-14 16:08
(I would also assume the primary should not send a VIEW-CHANGE in response to a new view timeout, I would check the paper, but only snuck off to do real work during a lunch break)

simon
2016-06-14 16:11
why shouldn't send a view change?

jyellick
2016-06-14 16:14
Well, I assume because we know that we are not byzantine, and we may have started the view change timer sooner than the backups. Also, because it is apparently specified as such in the Castro paper, and if it is causing problems to deviate from it, we should not.

cca
2016-06-14 16:18
As a primary i will operate under the assumption that i am able to perform the job, and not give up voluntarily. it is left to the others to kick me out. if my request does not get through (primary has sent pre-prepare but request does not commit), then i can't do anything because something like n-f nodes are reachable by assumption. if i cannot reach them, they should kick me out by triggering view change.

cca
2016-06-14 16:19
When I get f+1 VIEW-CHANGE msgs from others, then I would chime in as leader, yes, as I cannot prevent my expulsion any more.

jyellick
2016-06-14 16:20
@cca Since you here, so another quick question to you and @simon . On view change, we would like to not perform state transfer if we have all the needed requests less than the initial checkpoint selected by the new view. The only way we know that a particular request was committed is if we have a commit certificate in the previous view, we cannot deduce this from the pSets/qSets in general?

cca
2016-06-14 16:20
sure!

cca
2016-06-14 16:21
are your psets/qsets exactly like in the TOCS paper?

jyellick
2016-06-14 16:22
That is my understanding, @kostas I believe is primarily responsible for that code.

kostas
2016-06-14 16:22
that would be @simon actually

kostas
2016-06-14 16:22
and the answer to @cca is yes

jyellick
2016-06-14 16:22
(The problem is on state transfer, we must necessarily clear some state like outstanding requests, as we cannot tell if they were included in one of the transferred blocks [or older, still transferring blocks], so it can lead to us needlessly orphaning requests, also, state transfer is slower than execution in general. So if we have the knowledge to get to the current state without state transfer, that is preferable)

simon
2016-06-14 16:25
it is slower?

cca
2016-06-14 16:25
aha - but i dont understand this one yet - state transfer would mean on the ledger level, the KVS and everything? or some history of committed (decided) requests?

simon
2016-06-14 16:25
sync the ledger

cca
2016-06-14 16:26
the ledger is state that you want to transfer?

jyellick
2016-06-14 16:26
Yes, so, the particularly nasty scenario is as follows:

jyellick
2016-06-14 16:28
We are at block 5, the network is at block 1,000,000. We end up doing state transfer, and, because it would take us hours to get all million blocks, we only have say, blocks 0-5, and 999,999,990-1,000,000, and a copy of the current state. Ignoring the possibility of missing chaincode, there is nothing which prevents us from executing transactions and writing new blocks.

jyellick
2016-06-14 16:28
But, unless we have all million blocks, the ledger can't tell us which transactions have committed, it is part of the chain, not part of the state.

simon
2016-06-14 16:29
but why do we need to know which transactions have committed?

jyellick
2016-06-14 16:30
Ah, because we collect requests as they come in, and remove them from the outstanding list as they execute.

cca
2016-06-14 16:30
[ P = set of requests that have prepared according to my knowledge, in previous views; Q = set of requests that have pre-prepared; even from all those sets sent in all VIEW-CHANGE msgs that I receive, I cannot infer which ones have also committed... ]

simon
2016-06-14 16:30
right, so we need an R set

simon
2016-06-14 16:30
which is local use only

simon
2016-06-14 16:31
which is the set of requests we sent to the executor

jyellick
2016-06-14 16:31
On view change, almost always one replica ends up doing state transfer, and must discard all its outstanding requests.

simon
2016-06-14 16:31
yes, it must

jyellick
2016-06-14 16:31
When we have periodic view changes and large numbers of outstanding requests, every replica ends up (after a few iterations) discarding its outstanding requests, and we end up orphaning some.

cca
2016-06-14 16:32
If almost always a replica does state transfer on view change, you suggest that it could avoid this by receiving the committed requests that it missed, and apply those?

jyellick
2016-06-14 16:32
Right

jyellick
2016-06-14 16:32
In many cases, I would expect that it already has commit certs for these

cca
2016-06-14 16:33
well, if it has commit certs, then it can just run through them, not?

cca
2016-06-14 16:33
or is something else missing?

jyellick
2016-06-14 16:34
Yes, that was my first thought, my only concern is that view change calls for moving the watermarks to the h specified by the view change

jyellick
2016-06-14 16:34
The view change is obviously a complicated procedure, I'm just wary I'm missing something here.

simon
2016-06-14 16:35
but you could just delay processing the view change message (in/out)

simon
2016-06-14 16:35
and execute

cca
2016-06-14 16:35
yes, but i would not accept and perform the view change yet, if i peek at this and see that I can get there without state transfer, then I could just pretend the view-change has not yet arrived and work through my requests until I get there as well. once there, when I start doing the view-change, i will have the correct state already

cca
2016-06-14 16:35
(my suggestion == simon's)

simon
2016-06-14 16:35
:slightly_smiling_face:

jyellick
2016-06-14 16:36
Thanks, looks like they are telling me to get off my laptop, I'll think about it more, but that seems like a reasonable solution.

simon
2016-06-14 16:36
oppressive!

cca
2016-06-14 16:36
sure... but flow control remains an important topic for the protocol

cca
2016-06-14 16:36
this would just be an optimization that does not change it semantics

simon
2016-06-14 16:37
yea flow control :confused:

cca
2016-06-14 16:37
with more experience we will probably need to deal with the "lagging" peer in a different way... like, when it still sends I-am-alive signals, delaying the fast ones a bit, so that the lagging one can catch up

simon
2016-06-14 16:37
we're missing flow control for requests forwarded to primary

simon
2016-06-14 16:38
in other news, i can now prove what the problem is with poor performance

simon
2016-06-14 16:38
and it is exactly what i thought it was:

simon
2016-06-14 16:38
goroutine scheduling issues

cca
2016-06-14 16:38
aha, good to know.

cca
2016-06-14 16:38
make sure it becomes known beyond this channel

simon
2016-06-14 16:38
yep, i will

simon
2016-06-14 16:39
i don't think we will be able to work around this easily

tuand
2016-06-14 16:39
# channel

nits7sid
2016-06-15 13:27
i am running 4 peers using dockers under calssic pbft. But i am getting view change as soon as I deploy a chaincode. 13:25:15.453 [consensus/obcpbft/events] loop -> WARN 437 Attempting to stop an unfired idle timer

tuand
2016-06-15 13:31
I'm guessing that the deploy transaction data structure is very big and is taking too long to broadcast which in turn is causing the request timeout timer to fire

tuand
2016-06-15 13:33
couple things you can try ... increase the request timeout value in obcpbft/config.yaml or check your hyperledger/fabric source tree and see if there are old copies of logs and what not that can be deleted ( those files are getting pulled into the deploy transaction)

tuand
2016-06-15 13:34
if you're then still seeing a problem, can you create an issue and include the debug logs ?

nits7sid
2016-06-15 13:34
which logs ?

tuand
2016-06-15 13:35
set core_logging_level=debug

nits7sid
2016-06-15 13:35
ohh okay

nits7sid
2016-06-15 13:35
ill try that

nits7sid
2016-06-15 13:35
thanks:-)

jyellick
2016-06-15 14:28
That warning message is benign, and has been dropped in severity after https://github.com/hyperledger/fabric/pull/1777

simon
2016-06-15 15:34
jyellick: can state transfer revert existing state?

simon
2016-06-15 15:34
i thought it does not, kostas thinks it does

jyellick
2016-06-15 15:34
It can, it is configurable

simon
2016-06-15 15:35
ah so we do have an option to revert

jyellick
2016-06-15 15:36
It can overwrite existing blocks, there is no mechanism to delete existing blocks unfortunately

simon
2016-06-15 15:37
what about state?

simon
2016-06-15 15:37
will it apply deltas in reverse?

jyellick
2016-06-15 15:40
It will not

jyellick
2016-06-15 15:40
State can be recovered completely, and played forward

jyellick
2016-06-15 15:41
The ledger has the ability to apply deltas backwards

jyellick
2016-06-15 15:41
But state transfer does not utilize this. (In particular, because we want to try to verify the state snapshot before applying the deltas, but at that point, there is no point in going back in time)

simon
2016-06-15 15:47
so currently when we figure out that we diverged from the majority of the network, we will start state transfer, and state transfer will fix all data?

jyellick
2016-06-15 15:49
Correct

jyellick
2016-06-15 15:49
Unless you have twiddled the config to panic in this scenario

jyellick
2016-06-15 15:49
(But this is not the default)

jyellick
2016-06-15 15:51
Rough flow is: 1. Given state target which is different from what we have. 2. Go fetch the blocks from that target to our current believed valid block 3. Realize our state doesn't mesh with what's reported by the retrieved blocks, and that however much of the chain does not hash 4. New copy of the state is grabbed, corrupt blocks are grabbed, written 5. We're now in a good state, normal state transfer for recovery takes place

scottz
2016-06-15 18:43
has joined #fabric-consensus-dev

scottz
2016-06-15 19:01
@simon @jyellick Hi, Do you have line of sight yet, and can you provide a forecast for merging the pull request 1793 for issue 1056? Sharon and Barry and many others of us are eagerly anticipating your delivery, so we can run our regression and performance tests on a loadbuild that includes this (along with 1798).

jyellick
2016-06-15 19:04
@scottz As this fixes a critical issue, I'm okay with 1793 as is, the theoretical negative implications are still better than the current bad behavior, and we can address those in the future

jyellick
2016-06-15 19:05
I'll post to that effect in the PR

jyellick
2016-06-15 19:06
Looks like @simon will still need to rebase

scottz
2016-06-15 19:35
@simon @jyellick Thanks for the update. Then it sounds like we could get this fix by early tomorrow, if all goes well.

cbf
2016-06-15 20:18
@scottz: you do know you can cherrypick a pr and run tests... no need to wait

cbf
2016-06-15 20:19
it would help reinforce that the pr is good

harshal
2016-06-16 16:23
@harshal has left the channel

jyellick
2016-06-16 18:41

yingfeng
2016-06-17 10:35
Does fabric have the sequential semantic? Say, there are a series of messages A,B,C,D with an increasing timestamp, does fabric guarantee the execution order of transactions, such that transactions with larger timestamp will never be executed before those with smaller timestamp?

simon
2016-06-17 10:39
no

yingfeng
2016-06-17 10:42
so A,B,C,D will be concurrently executed without order guaranteed ?

simon
2016-06-17 10:45
if you submit them concurrently, they will be executed in a random order

yingfeng
2016-06-17 10:56
got it, thanks~

simon
2016-06-17 13:10
jyellick: you around?

jyellick
2016-06-17 13:10
Yep

jyellick
2016-06-17 13:10
Working on that fix to #1874

simon
2016-06-17 13:10
but what is the problem?

simon
2016-06-17 13:11
i've been trying to replicate problems all morning and didn't get anywhere

simon
2016-06-17 13:11
or rather, found that you fixed it already

jyellick
2016-06-17 13:13
The big one, is that after view change, you can get duplicate executions, if the view changes after a request makes it into the pset, then gets scheduled for resubmission before the primary executes it

jyellick
2016-06-17 13:13
All my stress testing for view changes with this was at checkpoint boundaries, so my psets were generally empty

simon
2016-06-17 13:14
how does that lead to freeze?

jyellick
2016-06-17 13:15
Ah, it doesn't! But, @tuand's behave test failed because the result was 'wrong'

jyellick
2016-06-17 13:15
So, no freeze, but potentially multiply executing transactions, which is is the problem I'm fixing

simon
2016-06-17 13:16
ah

simon
2016-06-17 13:16
so that's in the executor?

simon
2016-06-17 13:16
but don't we advance lastexec?

jyellick
2016-06-17 13:19
No, it's not in executor, it's a view change logic bug

jyellick
2016-06-17 13:20
We need to pull the requests out of the pset/qset which are in the new view, but which the new primary didn't initially order, and make sure we do not submit them to the network as outstanding requests

jyellick
2016-06-17 13:20
Otherwise we end up executing the same request in two different batches

simon
2016-06-17 13:20
oh we're talking about an issue in batch

simon
2016-06-17 13:20
not the core

jyellick
2016-06-17 13:21
On the second execute, we'll see, that somehow, even as the primary, we didn't know about that request (as we deleted it from our store), but, so as not to fork, we execute anyway

jyellick
2016-06-17 13:21
Right

simon
2016-06-17 13:21
ha!

simon
2016-06-17 13:21
well, as we said, somebody down the line should prevent replays anyways

simon
2016-06-17 13:22
was that problem also in my complainer/deduplicator?

simon
2016-06-17 13:22
i guess my code filters more aggressively

jyellick
2016-06-17 13:28
I think it was

jyellick
2016-06-17 13:28
Or rather, I would should say, you could certainly multiply submit requests to the network on view change

jyellick
2016-06-17 13:28
But, on execute, some or all of the nodes might filter it out

simon
2016-06-17 13:31
right

tuand
2016-06-17 13:43
did you guys see the question about `TestOutstandingReqsSubmission` failing in PR #1877 ?

jyellick
2016-06-17 13:50
Not yet, let me take a look

jyellick
2016-06-17 14:25
I'm planning to push the outstanding req fix to #1877 so will fix it if it is still failing then

simon
2016-06-17 14:25
jyellick: can you rebase it so that only new commits are in the PR?

jyellick
2016-06-17 14:26
I tried to do that... any handy commands I should know?

jyellick
2016-06-17 14:26
(Since my commits were squashed in that other PR, even when rebasing to master, all those other commits linger)

simon
2016-06-17 14:27
git rebase --onto upstream/master aa69ef

simon
2016-06-17 14:27
hm

simon
2016-06-17 14:27
or one below?

simon
2016-06-17 14:28
8b54?

simon
2016-06-17 14:28
i have a handy UI in emacs for that

jyellick
2016-06-17 14:39
Thanks, I'll give it a shot

jyellick
2016-06-17 14:58
@tuand: Are you planning on submitting a PR for those behave test, or should I include them in the PR I'm submitting?

tuand
2016-06-17 14:59
simon has a pr for the #1874 behave test ... go ahead and add the #1873 one to your pr

jyellick
2016-06-17 15:01
Thanks, will do

jyellick
2016-06-17 15:10
Have code changes which fix the behave test, need to write some unit tests, then will rebase and submit

tuand
2016-06-17 15:16
@jyellick: code changes for #1874 ? if so, check if .behaverc is skipping @issue_1874 and remove. I just sent pr #1898 because all the builds are failing

jyellick
2016-06-17 15:16
Yes, for #1874

jyellick
2016-06-17 15:17
Thought you said I should include that behave test in my PR, it's already there?

tuand
2016-06-17 15:17
include the behave test for #1873

jyellick
2016-06-17 15:38
Ah, got it

sheehan
2016-06-17 20:07
Tests have started to fail with ``` --- FAIL: TestSieveNoDecision (7.01s) obc-sieve_test.go:139: replica 0 in epoch 2, expected 1 obc-sieve_test.go:139: replica 1 in epoch 2, expected 1 obc-sieve_test.go:139: replica 2 in epoch 2, expected 1 obc-sieve_test.go:139: replica 3 in epoch 2, expected 1 ``` Is this a known issue? Seems it started on an unrelated change

jyellick
2016-06-17 20:09
I've seen this, I'll add a skip to it

jyellick
2016-06-17 22:29

jyellick
2016-06-17 22:31
@tuand: @kostas @simon Added some commits to https://github.com/hyperledger/fabric/pull/1877 unfortunately it spiked the complexity a little, but I've spent the afternoon testing, passing it through busywork, etc., so hopefully it is pretty stable. Would like to write some more specific unit tests for the `requestStore`, but ran out of time and wanted to at least put it out there for review

scottz
2016-06-19 07:20
It seems 1877 fixed 1873 but not 1874. https://github.com/hyperledger/fabric/issues/1874

c0rwin
2016-06-19 10:31
has joined #fabric-consensus-dev

scottz
2016-06-19 16:36
@scottz uploaded a file: https://hyperledgerproject.slack.com/files/scottz/F1J755SRH/s1s2ir2iq.go and commented: I updated 1874. I cannot yet explain why it fails, even though the behave test passes, and they seem to perform the same steps. I even modified my testcase to avoid extra things and avoid stopping peer0.

yingfeng
2016-06-20 06:19
It seems the behaviors of latest version under stress test have been changed: Previously, I setup 4 nodes with `pbft` and `batch`, after deploying chaincode `example2`, I use jmeter to send restful API to certain nodes to send `invoke` requests. The results show that jmeter could reach nearly 10K TPS, while fabric will process the transaction at a speed of 20 per second or so, therefore the left requests are discarded. For the latest codebase, I can see jmeter only show tens of request around per second, while at the same time any other REST clients will be blocked. It seems the behaviors have been changed from `asynchronous` to `near synchronous` ? Additionally, there are some bugs for the latest codebase, because I can only see the `invoke` transactions have been executed for several times, while the later requests will not take into effects on the chaincode—I drew the conclusion through continuously sending `query` requests to peer node.

simon
2016-06-20 06:58
yingfeng: what do you mean, several times and no effect?

yingfeng
2016-06-20 06:59
yes, latest commit show that only several `invoke` requests have taken into effects

yingfeng
2016-06-20 07:00
because the results from `query` requests keep unchanged for a long time, although jmeter had been running for minutes

simon
2016-06-20 07:03
do you have logs?

simon
2016-06-20 07:03
debug logs

simon
2016-06-20 13:57
jyellick: i just replaced the O(n) + reflect request store with a bit better performance

simon
2016-06-20 13:57
but i think we should use a better data structure

simon
2016-06-20 13:57
sorting a slice is just plain awful

simon
2016-06-20 13:58
with 100 entries, it takes avg 2ms to do an add, test, remove

simon
2016-06-20 13:58
well, for 100 entries

simon
2016-06-20 13:58
for 1000 entries it takes 215ms

simon
2016-06-20 13:58
this sorting is expensive

jyellick
2016-06-20 14:00
Yes, I've thought the same for the `requeststore` thing I added, I know the performance must just be awful

jyellick
2016-06-20 14:00
As the slice sorting is being done via `reflect.DeepEqual` additionally

simon
2016-06-20 14:01
people already complained :slightly_smiling_face:

simon
2016-06-20 14:02
do we have to keep it sorted?

simon
2016-06-20 14:02
can't we sort it just when we need to?

jyellick
2016-06-20 14:02
We can

jyellick
2016-06-20 14:03
I thought it might be cheaper to try to keep it ordered than to order it before each access

simon
2016-06-20 14:03
why do we have to keep them ordered?

jyellick
2016-06-20 14:06
The behave tests tend to get upset if we do not, as they look for the last transaction they submitted to be committed as a signal that all have been committed. Additionally, it's the intuitive behavior.

simon
2016-06-20 14:07
well

jyellick
2016-06-20 14:07
And prevents a request from getting starved. Say if we go on map hash order, you could end up maintaining a huge queue of requests, and the requests which happen to land near the end of map iteration will constantly stay at the back of the queue and effectively never execute.

simon
2016-06-20 14:07
i see

simon
2016-06-20 14:07
but then a simple sequence would do

simon
2016-06-20 14:07
not ordered by time

simon
2016-06-20 14:07
just appended

jyellick
2016-06-20 14:09
Yes, I thought ordered by time would be ideal, but it's not strictly necessary.

simon
2016-06-20 14:09
okay

simon
2016-06-20 14:09
let me see what i can do

jyellick
2016-06-20 14:09
Really, I think we we used a better datastructure, like a tree, we could store by time efficiently

simon
2016-06-20 14:09
because this quadratic behavior is not good at all

simon
2016-06-20 14:09
we could

jyellick
2016-06-20 14:09
Or, even if we did insertion ordering on a linked list

simon
2016-06-20 14:09
yes

simon
2016-06-20 14:09
that's what i was going for

jyellick
2016-06-20 14:10
Would definitely be an improvement, I'm the first to admit that PR opted for 'clear correctness' in the face of terrible performance

simon
2016-06-20 14:13
:slightly_smiling_face:

simon
2016-06-20 14:13
but i see quite an overlap with the complainer reqstore stuff :slightly_smiling_face:

jyellick
2016-06-20 14:22
Yes, definitely so, in retrospect it might have been better to remove complaints, but not the complainer reqstore and re-use that. Sadly sometimes it takes nearly reimplementing something to understand the decisions made in it.

simon
2016-06-20 14:30
yep

simon
2016-06-20 14:30
see executor :slightly_smiling_face:

simon
2016-06-20 14:31
yey

simon
2016-06-20 14:31
O(1) restored

simon
2016-06-20 14:34
i think i made a blunder with my broadcast

simon
2016-06-20 14:34
we will queue and not drop messages as long as grpc does

simon
2016-06-20 14:35
plus grpc message reordering (unless they use a queueing mutex)


simon
2016-06-20 14:36
i'm about to walk to the train

jyellick
2016-06-20 14:41
Thanks @simon, appreciate the fix

simon
2016-06-20 14:43
sure

simon
2016-06-20 14:43
finally something not distributed :slightly_smiling_face:

jyellick
2016-06-20 15:59
@simon: I'm looking at `broadcast.go` and as best as I can tell, the message queue channels aren't ever read or written to? I think maybe this is what you were referring to on the scrum call today?

simon
2016-06-20 16:01
oh i didn't remove the channels?

simon
2016-06-20 16:01
yea

jyellick
2016-06-20 16:08
I think I'm seeing symptoms of arbitrary message ordering which is breaking busywork, currently putting together a changeset which doesn't spawn all the goroutines and utilizes those channels

jyellick
2016-06-20 16:15
https://github.com/hyperledger/fabric/pull/1927 @simon @kostas @tuand a pretty simple changeset if you guys could quickly review

jyellick
2016-06-20 18:03
@simon, I don't imagine you're around?

simon
2016-06-21 11:42
unfortunately my behave doesn't work well for 1874

simon
2016-06-21 11:42
i'm having problems with deploy

tuand
2016-06-21 12:26
gimme a few minutes ... doing a vagrant destroy/up ... will get the debug logs for the @issue_1874 tests

simon
2016-06-21 12:50
why can't i deploy chaincode? puzzling, puzzling

tuand
2016-06-21 13:03
what's the error on deploy ? is it a timeout again ?

simon
2016-06-21 13:10
no error

simon
2016-06-21 13:10
like it never finishes

simon
2016-06-21 13:10
but not even a timeout

simon
2016-06-21 13:10
ah now a timeout

simon
2016-06-21 13:11
2 minutes timeout

simon
2016-06-21 13:11
also there is no container running

simon
2016-06-21 13:11
is that some weird non-vagrant thing again?

tuand
2016-06-21 13:13
maybe ... i'm having to do vagrant destroy because when i tried make behave-deps earlier, i got "no permissions to install"

jeffgarratt
2016-06-21 13:33
@tuand, that may be an issue with the installation of grpcio package for python

simon
2016-06-21 13:35
yea i installed that

simon
2016-06-21 13:35
it's really that the chaincode subsystem can't start a container, it seems

jeffgarratt
2016-06-21 13:42
@simon do you notice a new image downloaded?

simon
2016-06-21 13:43
see in #

simon
2016-06-21 13:43
doesn't seem to be able to connect to the docker process

jeffgarratt
2016-06-21 13:44
wondering if a base image is being downloaded and the chaincode container is taking a while to build?

jeffgarratt
2016-06-21 13:44
does 'docker images' show new age on any of the base images?

simon
2016-06-21 13:49
no, nothing is being downloaded

simon
2016-06-21 14:29
jyellick: doesn't seem to be a consensus issue?

simon
2016-06-21 14:29
@jeffgarratt: could you have a look at 1874 - seems peers are not automatically reconnecting to each other?

jyellick
2016-06-21 14:30
@simon: Was putting together the final changes for 1928, haven't had a chance to look at the logs yet

simon
2016-06-21 14:30
i.e. if the rootnode bounces, it cannot join back (and probably other nodes can't join the network either)

simon
2016-06-21 14:30
okay

simon
2016-06-21 14:30
vp0 comes up, and nobody ever connects to it, nor does it connect to other nodes

simon
2016-06-21 14:30
i think that goes back to the lack of a fixed peer list we should have

jyellick
2016-06-21 14:32
Is this the 'root discovery node' thing?

simon
2016-06-21 14:34
i think this is it

simon
2016-06-21 14:35
we probably would have to persist a list of nodes/ips that we ever saw, and try to connect to any of those

simon
2016-06-21 14:35
but that's in the peer - consensus can't do anything about it

kostas
2016-06-21 14:38
I think there might be an easier fix

kostas
2016-06-21 14:41
have the rest of the nodes attempt to reconnect to the rootnode if they're not connected to it

jeffgarratt
2016-06-21 14:41
@simon seems to pass with latest master. I touched base with @lhaskins, she will contact me if any more issues

kostas
2016-06-21 14:41
so invoke this line (if the rootNode is not in your peersList) https://github.com/hyperledger/fabric/blob/master/core/peer/peer.go#L537

kostas
2016-06-21 14:42
fix, as in "stopgap measure"

simon
2016-06-21 14:42
jeffgarratt: well, it fails here with latest master

jeffgarratt
2016-06-21 14:42
hmmm

jeffgarratt
2016-06-21 14:42
there is a new ensureConnected function that runs in background

simon
2016-06-21 14:43
i don't know why

simon
2016-06-21 14:43
also imagine: vp0 stays down

simon
2016-06-21 14:43
how does vp3 connect back to vp1 and vp2?

jeffgarratt
2016-06-21 14:43
if they were NOT given a rootnode list, they would not, unless retrieved from discovery from remaining

simon
2016-06-21 14:44
ah that logic doesn't work

simon
2016-06-21 14:44
see, the problem is that vp1 and vp2 are connected with each other

simon
2016-06-21 14:44
so len(peersMsg.Peers) > 0


simon
2016-06-21 14:45
so do we just have to give all peers a full rootnode list?

simon
2016-06-21 14:45
kostas: what if the rootnode is down?

simon
2016-06-21 14:45
how does vp3 connect to vp1 and vp2?

simon
2016-06-21 14:46
if the answer is, populate the rootnode list with all validators, fine

simon
2016-06-21 14:46
that's perfect

simon
2016-06-21 14:47
if this is something we can do

kostas
2016-06-21 14:47
I see your point, with vp0 down, if vp1 also goes down and comes back up, vp3 won't ever be able to reconnect to it

jeffgarratt
2016-06-21 14:47
correct

jeffgarratt
2016-06-21 14:47
we would need to add some sort of recently connected logic with retry

jeffgarratt
2016-06-21 14:47
which would be fairly simple to do

simon
2016-06-21 14:48
jeffgarratt: can we have more than 1 node in the root list?

jeffgarratt
2016-06-21 14:48
but the rootNode list would work for now

jeffgarratt
2016-06-21 14:48
yes

kostas
2016-06-21 14:48
it's an array, yes

simon
2016-06-21 14:48
and the peer tries to keep connections to all of them?

jeffgarratt
2016-06-21 14:48
no, only if totally lost conns

simon
2016-06-21 14:48
oh

simon
2016-06-21 14:48
that's bad

jeffgarratt
2016-06-21 14:48
could add that fairly easily

simon
2016-06-21 14:48
because it means that we can have a partitioned network

jeffgarratt
2016-06-21 14:49
we can definitely add more intelligence to the 'maintained' connections

simon
2016-06-21 14:49
i think if we can try to keep a connection to every node in the rootnodes, we're good

jeffgarratt
2016-06-21 14:49
say isntead of Peers ==0, Peers < len(rootNodes)

simon
2016-06-21 14:49
yep

simon
2016-06-21 14:49
well

kostas
2016-06-21 14:49
yeah

simon
2016-06-21 14:49
and filter yourself from rootnodes

simon
2016-06-21 14:49
then all nodes can have the same config

jeffgarratt
2016-06-21 14:50
simple enough

simon
2016-06-21 14:50
okay, should i add to the issue that you're on it?

jeffgarratt
2016-06-21 14:51
sure thing

simon
2016-06-21 14:51
okay, this will have to go into release and master

simon
2016-06-21 14:52
thanks!

simon
2016-06-21 14:55
okay

jyellick
2016-06-21 15:03
@simon https://github.com/hyperledger/fabric/pull/1938 this incorporates the timeout you requested if you could take a look

jyellick
2016-06-21 15:06
@kostas @simon @tuand Even with the broadcast ordering fix (which helps), I'm still seeing the occasional duplicated request slip through under busywork. What is happening with vp0 as primary and vp1 receiving a request, is: vp1 broadcasts request A to network vp0,vp2 receives request A from vp1 vp0 sends pre-prepare to network Network prepares and commits request A, including vp3 Finally, vp3 receives broadcast of request A, which adds it to outstanding requests, and eventually gets it executed

jyellick
2016-06-21 15:08
I asked @sheehan about querying the ledger for existing transaction UUIDs before ordering, but that is a DB hit, which means it may require disk IO, and is probably not something we want in our consensus path if we can avoid it.

jyellick
2016-06-21 15:09
I think it wouldn't be too difficult to hook the deduplicator code from @simon which is still sitting there, to try to squash this. I think it opens up the censorship window again, but not executing a transaction seems better than executing it twice.

jyellick
2016-06-21 15:11
This is something that I've only seen under the high load of busywork, and obviously requires some odd network latencies. What does everyone think about what approach should be taken, and whether this is critical for 0.5

kostas
2016-06-21 15:16
if we keep track of the last X (10? 50?) requests executed, would that mitigate the problem? and if the answer is yes, can we store these requests in a data structure without O(n) search times? (which I think was Simon's original concern)

simon
2016-06-21 15:22
who needs to deduplicate, that's the question

jyellick
2016-06-21 15:22
So, I did implement that list last night as a "will this work", and the short answer is "generally yes". The thing I dislike about it though, is that you're basically just making an unlikely event less likely, not completely eliminating it. Maybe that's good enough, especially as ultimately, it seems like the chaincode/crypto layer will need to defend against more malicious replay.

simon
2016-06-21 15:22
and i think the answer is: everybody, and during pre-prepare

simon
2016-06-21 15:23
and this is something that security needs to do

simon
2016-06-21 15:23
because these occasional duplications right now are because of our code

simon
2016-06-21 15:23
but any byzantine actor could resubmit requests at will

jyellick
2016-06-21 15:25
Right. We could certainly check the DB before accepting any request for ordering (either accepting a pre-prepare, or sending one), but obviously that slows us down, and as we talk of splitting the consensus network from the endorsers and actual ledger, that becomes even more expensive.

simon
2016-06-21 15:25
we really need to fuse batch and core

jyellick
2016-06-21 15:25
Agreed

simon
2016-06-21 15:26
how does pbft solve this?

jyellick
2016-06-21 15:26
I think PBFT assumes that broadcasts are atomic

jyellick
2016-06-21 15:27
Or, at least that messages arrive in the order in which they were sent

jyellick
2016-06-21 15:27
(across nodes)

simon
2016-06-21 15:28
so how come vp3 receives them in a different order?

simon
2016-06-21 15:28
the broadcaster issue?

jyellick
2016-06-21 15:29
Before the patch to the broadcaster, it was much more common

simon
2016-06-21 15:30
so why is it still possible at all?

jyellick
2016-06-21 15:30
The promise we get is that: "Messages sent by one node to another will arrive in the order they were sent", and my suspicion is that PBFT is assuming "A message sent in response to a broadcast, will arrive after the broadcast has been received by all nodes"

jyellick
2016-06-21 15:31
This is possible, because vp3 can commit a request without receiving any messages from vp1

simon
2016-06-21 15:31
well that is silly

simon
2016-06-21 15:31
that assumption

jyellick
2016-06-21 15:32
Maybe that assumption isn't really there, I am looking at the paper now

simon
2016-06-21 15:32
oh!

simon
2016-06-21 15:32
ha!

simon
2016-06-21 15:32
or not?

simon
2016-06-21 15:32
does the primary include the request in the pre-prepare?

simon
2016-06-21 15:33
i think it does

simon
2016-06-21 15:33
and we did that, because otherwise replicas would discard pre-prepares if they didn't have a matching request

jyellick
2016-06-21 15:34
It may be the timestamp deduplication

jyellick
2016-06-21 15:34
Ah, yes, that's true

simon
2016-06-21 15:34
so if we take that out

simon
2016-06-21 15:35
then vp3 would discard pre-prepare, and immediatelly get left behind

simon
2016-06-21 15:35
that's silly

simon
2016-06-21 15:35
so we'd have to queue messages

jyellick
2016-06-21 15:35
Yes, we would be trading one problem for another

simon
2016-06-21 15:35
and possibly fetch the request

simon
2016-06-21 15:35
but it would reduce the amount of work the primary has to do

simon
2016-06-21 15:36
well, network IO

simon
2016-06-21 15:36
completely impossible to reason about this

jyellick
2016-06-21 15:36
But this doesn't really solve the duplication issue. I think that it's got to be the timestamping. PBFT wants the client to broadcast requests with incrementing timestamps

simon
2016-06-21 15:36
yes

simon
2016-06-21 15:37
but then request can get lost

jyellick
2016-06-21 15:37
Yes

simon
2016-06-21 15:37
if they are submitted concurrently

jyellick
2016-06-21 15:37
Exactly, or, the byzantine primary can always pick the highest timestamp to order first, censoring all previous requests

simon
2016-06-21 15:37
yes

jyellick
2016-06-21 15:40
"Additionally, replicas need to remember the 8-byte timestamp of the last request executed by each client to ensure exactly once semantics. But since timestamps are small and timestamps of inactive clients can be stored on disk, this should not cause a significant scalability problem."

jyellick
2016-06-21 15:40
I guess the idea is that clients are supposed to wait for a reply before submitting a new request

jyellick
2016-06-21 15:40
It seems like that would lower throughput, but we could certainly implement it as such.

simon
2016-06-21 15:47
yea

simon
2016-06-21 15:47
that's what i had

simon
2016-06-21 15:47
ah, executed

simon
2016-06-21 15:48
so that would have to come from the ledger then

simon
2016-06-21 15:48
because on state transfer, consensus loses that information

jyellick
2016-06-21 15:48
Yes

simon
2016-06-21 15:48
so i think we should declare this a problem of some other part of the stack

simon
2016-06-21 15:48
which it is.

simon
2016-06-21 15:49
if somebody before consensus replays the transaction, we can't do anything about it

jyellick
2016-06-21 15:50
So I think the reply you'll here, is that "we should not order transactions which are expected to fail"

jyellick
2016-06-21 15:52
But I think if we ever want to achieve truly high throughput, we really need to not perform a ton of introspection on requests. I would reply with "Who cares about the validity of what comes out of consensus, so long as everyone agrees on the contents and the order"

simon
2016-06-21 15:52
hm

jyellick
2016-06-21 16:00
@simon: Did you have a chance to review the broadcast changes?

simon
2016-06-22 13:49
jyellick: you around?

jyellick
2016-06-22 13:49
I am

simon
2016-06-22 13:51
i'm looking at some traces from #

simon
2016-06-22 13:52
first, we need to drop requests

simon
2016-06-22 13:52
people submit 1000s of requests per second...

jyellick
2016-06-22 13:52
Yes, completely agree

jyellick
2016-06-22 13:52
But we need a way to signal rejection

simon
2016-06-22 13:52
second, i see a request being re-queued

simon
2016-06-22 13:53
but i never see it being added in the first place

jyellick
2016-06-22 13:53
I'm not sure what you mean

simon
2016-06-22 13:53
how is that possible?

simon
2016-06-22 13:53
`ESC[36m11:00:01.440 [consensus/obcpbft] leaderProcReq -> DEBU 12e07aESC[0m Batch primary 0 queueing new request qqlXTMDzm805n6gLNiVkjHKC2d/5Nji7qxZ3iuWV+fcwx9u/nRtoiJXqvI4RxSyxqAcihidLrbAILI9MrLWCKQ== `

simon
2016-06-22 13:54
but i never see this request being added to the outstanding request store before?

jyellick
2016-06-22 13:55
Yes, the ingress path is slightly different for the leader which we may want to fix.

simon
2016-06-22 13:55
`ESC[36m11:04:25.054 [consensus/obcpbft] startTimer -> DEBU 2c2fe3ESC[0m Replica 0 starting new view timer for -1012918h58m43.28654848s: new view change`

simon
2016-06-22 13:55
LOL

jyellick
2016-06-22 13:55
Oops... that seems like a bug

simon
2016-06-22 13:56
i'm tempted to set K&L=1

simon
2016-06-22 13:56
and basically disable parallelism

simon
2016-06-22 13:56
and instead use batches

simon
2016-06-22 13:57
that thing probably has absorbed a million requests, but didn't advance past h=10

jyellick
2016-06-22 13:57
When the leader receives a request from a backup, it immediately adds it to the current batch, and shortly thereafter sends it off to PBFT. The leader doesn't add it to outstanding requests in this path, but probably should. The counter argument is that it's a nonbyzantine primary, so why bother tracking it, it has submitted it appropriately for ordering.

simon
2016-06-22 13:58
so that if something happens and there is a view change, it can complain about the new primary?

jyellick
2016-06-22 13:58
But, it will lead to warning messages that it is executing stuff not in its outstanding requests

simon
2016-06-22 13:58
i don't know

jyellick
2016-06-22 13:58
I think it's a bug, it should be fixed

simon
2016-06-22 13:59
ah hm, out of sequence numbers

simon
2016-06-22 13:59
could it be that we cannot include a null request sequence number on view change?

simon
2016-06-22 13:59
ah no, we don't have to include a null request

simon
2016-06-22 14:00
okay, we first need a way to reject this amount of requests

simon
2016-06-22 14:01
because the traces we get that way are completely useless

simon
2016-06-22 14:01
so i think we need to ration the number of requests any replica can have outstanding

jyellick
2016-06-22 14:01
Yes, I think so too

jyellick
2016-06-22 14:02
To do that we need to differentiate incoming transactions from consensus messages. The problem we have today is that it all comes through RecvMsg, and by the time our thread gets a chance to look at it, we've already accepted it.

simon
2016-06-22 14:02
yes

jyellick
2016-06-22 14:02
I think consensus messages, which will be dropped if they come in too fast via the buffered channel sitting in front of them, should continue to be accepted the way they are today.

simon
2016-06-22 14:03
well no, we could return an error in recvmsg

jyellick
2016-06-22 14:03
New incoming chain transactions should come in via another call which returns an error if there are more than X outstanding.

simon
2016-06-22 14:03
which will be passed back to the rest

jyellick
2016-06-22 14:03
We could. I dislike the idea of doing logic in RecvMsg though.

simon
2016-06-22 14:03
but yea, we should split the interface

simon
2016-06-22 14:03
like i did months ago

simon
2016-06-22 14:04
request and receive

simon
2016-06-22 14:04
that won't protect us against a byzantine replica just blasting out requests

simon
2016-06-22 14:05
but there it is more difficult to protect against

jyellick
2016-06-22 14:05
Ah! So I had an idea about this that I was discussing a bit with @kostas

jyellick
2016-06-22 14:08
The original PBFT design calls for clients to submit a request, and wait for the result before submitting a new request. This fixes the censorship problem, but really poses a significant throughput problem per client. The problem is amplified in the fabric, because only replicas are clients, so, if you followed that model strictly. You could only ever have up to n requests outstanding.

jyellick
2016-06-22 14:11
So, my thought was, this 'submit one and wait' model should be the goal, because it's been formally proven to be valid. What we could do though, is basically have up to 'm' slots per replica for outstanding requests, say, for instance 10. When vp1 receives a request, it first checks to see if it has any unoccupied request slots, if so, it picks one, and broadcasts the request to the network, indicating that this new request should be assigned to slot 2 (for instance). The network then stores this request as the outstanding request from vp1's slot 2, and waits for the request to be executed. Once executed, that slot is freed, and a new request may be submitted for that slot.

jyellick
2016-06-22 14:13
The primary cannot censor any request, because each slot is being monitored for censorship. You could essentially think of it like 'virtual clients'.

simon
2016-06-22 14:15
yes

jyellick
2016-06-22 14:15
If a replica sent a request for a slot that was already occupied, the slot can be safely overridden (assuming the timestamp has incremented), because the sending replica will only do this once it's confident that that request has been submitted for ordering.

simon
2016-06-22 14:15
do we only allow one outstanding request per virtual client?

jyellick
2016-06-22 14:15
Yes

simon
2016-06-22 14:15
aha

simon
2016-06-22 14:15
okay

simon
2016-06-22 14:15
so this is just for tracking

simon
2016-06-22 14:15
nice

simon
2016-06-22 14:16
so submission overrides

simon
2016-06-22 14:16
but that is equivalent to just giving a replica a list of 10 outstanding requests, no?

simon
2016-06-22 14:16
if it proposes a new one, the oldest one is discarded

jyellick
2016-06-22 14:16
Ah, not the oldest one

simon
2016-06-22 14:16
it doesn't prevent the massive submission of requests though

jyellick
2016-06-22 14:17
When it proposes a new one, by including the slot number, it indicates which old one should be discarded.

jyellick
2016-06-22 14:17
If you simply discard the oldest, then the primary may censor, by never ordering the oldest

jyellick
2016-06-22 14:20
A replica could send you a flood of requests, but it would essentially be censoring itself, because the requests would be overwritten before they had been ordered.

simon
2016-06-22 14:22
well

simon
2016-06-22 14:22
what if it waits for a pre-prepare

simon
2016-06-22 14:22
and then sends again

jyellick
2016-06-22 14:23
I think it needs to wait for a prepare quorum cert before it can be confident that the request won't be censored

simon
2016-06-22 14:23
it could sort of game the system

simon
2016-06-22 14:23
but not too much

simon
2016-06-22 14:23
yes, but typically requests don't get censored

simon
2016-06-22 14:23
but if the number of outstanding requests is less than the batch size, it will have to wait for batch creation

jyellick
2016-06-22 14:25
I guess I don't really see 'sending too many requests' as abusive, so long as they are all getting processed. We want to make sure requests are not censored. I guess you could argue that one replica getting an unfair proportion of requests through is almost like censorship.

simon
2016-06-22 14:26
oh i don't care about censorship

jyellick
2016-06-22 14:26
If we wanted to get really clever, we could keep a counter of the number of pending requests per replica, and use that as a weight when picking which replica's request to order next. But I'm not really convinced that this is a problem yet.

simon
2016-06-22 14:26
right now

simon
2016-06-22 14:26
i care about limiting number of requests being ingressed by the system

jyellick
2016-06-22 14:27
I agree a peer could cheat a little by sending new requests once it sees its own in a pre-prepare, but I don't know that this really buys them much. Unless the outstanding size is greater than the batch size, it cannot use this to prevent other requests from executing.

simon
2016-06-22 14:32
yes, right now i'm just worried about system overload

jyellick
2016-06-22 14:32
Ah, okay. So, I think this helps to keep another replica from sending you a ton of requests and overloading the system, but obviously it does nothing to address something like the REST API from flooding you with a ton of requests.

simon
2016-06-22 14:32
yes

simon
2016-06-22 14:33
it's a two step thing

simon
2016-06-22 14:33
correct client rejecting extra requests

simon
2016-06-22 14:33
and system protecting against byzantine flooding peers

simon
2016-06-22 14:34
i'll head outside to catch some (rare) sun

simon
2016-06-22 14:34
after that i will integrate the telemetry stuff

simon
2016-06-22 14:34
so that we can start seeing drops and queue sizes

simon
2016-06-22 14:34
then we split the recvmsg path

jyellick
2016-06-22 14:35
Sounds good, I may not get around to it today, but I can work on the 'virtual client' stuff

jyellick
2016-06-22 14:35
Assuming you're on board

simon
2016-06-22 14:35
it's just a fixed array, no?

jyellick
2016-06-22 14:35
Yep, I don't anticipate it will be horribly complicated

simon
2016-06-22 14:35
okay

simon
2016-06-22 14:36
what do think needs to be done for the release?

simon
2016-06-22 14:36
i don't think we can get this limiting done

jyellick
2016-06-22 14:36
For 0.5?

simon
2016-06-22 14:37
well, urgent stuff

jyellick
2016-06-22 14:38
There were the three PRs I submitted yesterday, adding back in the deduplication, the view change resend, and the broadcaster ordering. The peer bouncing reeconnect stuff I think @kostas has a handle on. Those were the only big outstanding things I was aware of. Certainly we need to rate limit incoming transactions, you can kill the system just by submitting too many transactions, but I don't think that fits.

tuand
2016-06-22 14:40
so for 0.5 , fixes for #1874, bishop's #1857

simon
2016-06-22 14:41
74 is handled by jeff and kostas, 57 - jason?

tuand
2016-06-22 14:42
1857 jason's PRs, yes

tuand
2016-06-22 14:43
plus if scottz and co. find any regressions but that would be case by case

jyellick
2016-06-22 14:43
As best as I can tell 1857 is fixed. The caveat is that sometimes, there will be a failure if null requests are not enabled (and not a true failure by pbft definitions, just by people's expectations)

simon
2016-06-22 14:44
great

brendan
2016-06-22 19:42
has joined #fabric-consensus-dev

kostas
2016-06-22 22:12
just catching up on this thread -- I'm a fan of the virtual clients idea

dianfu
2016-06-22 23:42
has joined #fabric-consensus-dev

yingfeng
2016-06-23 08:32
This is the pipelining support for Raft, could pbft be modified to support pipeling as well such that the throughput could be improved? ``` Pipelining is also well-supported by Raft. The AppendEntries consistency check guarantees that pipelining is safe; in fact, the leader can safely send entries in any order. To support pipelining, the leader treats the next index for each follower optimistically; it updates the next index to send immediately after sending the previous entry, rather than waiting for the previous entry’s acknowl- edgment. This allows another RPC to pipeline the next entry behind the previous one. Bookkeeping is a bit more involved if RPCs fail. If an RPC times out, the leader must decrement its next in- dex back to its original value to retry. If the AppendEntries consistency check fails, the leader may decrement the next index even further to retry sending the prior entry, or it may wait for that prior entry to be acknowledged and then try again. Even with this change, LogCabin’s original threading architecture still prevented pipelining because it could only support one RPC per follower; thus, we changed it to spawn multiple threads per peer instead of just one. ```

simon
2016-06-23 10:45
yingfeng: pbft uses pipelining

simon
2016-06-23 10:46
yingfeng: thoughput is limited by chaincode execution, not by pbft

zuowang
2016-06-23 10:54
simon: do you know why the chaincode execution should be done one after another? Could you site any paper to me to get a fully understand? Thank you very much!

yingfeng
2016-06-23 11:06
simon: do you mean chaincode execution is the bottleneck? then why single peer node without p2p will have a much higher tps? i think it should because of the sequential execution that has lead to the blocking between two batches

simon
2016-06-23 11:21
it does?

kostas
2016-06-23 16:38
Going back to the conversation about locally generated requests versus external ones and the issue of fairness (let's set aside pacing for now). If I'm parsing the path correctly, a request generated locally via the REST API will essentially take a shortcut to the consenter's `RecvMsg()` method. The external request, on the contrary, gets there via the goroutine who services the gRPC connection with the submitting replica, who adds it to the handler's `consenterChan`, where another goroutine will pick it up from and add it to the consenter's `out` channel, where the goroutine that was spawned in the `GetEngine` constructor will pick it up and pass it on to `RecvMsg()`. At any rate, in `RecvMsg()`, both the REST thread (that serves the locally generated request) and the goroutine spawned in `GetEngine` (serving the external request) try to add their respective requests to the unbuffered `events` channel of the manager in `obcpbft` for processing.

kostas
2016-06-23 16:40
_If_ the paths are right, can we make any claims on lack of fairness? I don't quite see it.

jyellick
2016-06-23 16:53
I don't think there is a fairness problem there. The fairness problem would arise only if we split the `RecvMsg` into `RecvConsensus` and `RecvLocalTran` and prioritized the `RecvConsensus` over `RecvLocalTran`. But, if instead we make them equal priority when there is room in our outstanding reqs, and otherwise ignore `RecvLocalTran` I think we're fine.

kostas
2016-06-23 16:57
Why would we ignore `RecvLocalTran()` when `outstandingReqs` is full, and not keep them at the same priority?

jyellick
2016-06-23 17:05
It would be when `localOutstandingReqs` is full, so that we can provide backpressure to the invoker. Because it is only based on our own requests, we cannot be spammed by the network into ignoring local invokes.

kostas
2016-06-23 17:08
I do not follow, and that is because you're switching to a different model and I don't think we have the same set of assumptions in mind.

kostas
2016-06-23 17:09
Let's discuss how the `localOutstandingReqs` structure will be used for instance.

kostas
2016-06-23 17:10
A locally generated request gets routed to `RecvLocalTran`, and is then placed to a finite-size `localOutstandingReqs`.

jyellick
2016-06-23 17:11
If there is no room in that structure, that call blocks until there is

jyellick
2016-06-23 17:11
(Or, blocks and times out, or returns an error on block, whatever defined behavior we want)

kostas
2016-06-23 17:16
If there is room, the request is added there, sure. Do you then assume a separate goroutine that attempts to pick it up and pass it on to the manager's queue?

jyellick
2016-06-23 17:23
I think that would probably be the easiest way to implement it.

jyellick
2016-06-23 17:24
Additional go routines should not be problematic so long as they do not manipulate or reference PBFT internal state

kostas
2016-06-23 17:25
> But, if instead we make them equal priority when there is room in our outstanding reqs, and otherwise ignore `RecvLocalTran`

clessor
2016-06-23 17:28
has joined #fabric-consensus-dev

kostas
2016-06-23 17:29
So it's not that you ignore `RecvLocalTran`. Both the goroutine that it spawns, and the goroutine that is spawned via `RecvConsensus` always get the same treatment when it comes to adding to the manager's queue. (Whatever treatment the Go scheduler gives them, that is.) Instead, it may happen that the `RecvLocalTran` call will drop the transaction that it brings along, if the `localOutstandingReqs` store is full.

jyellick
2016-06-23 17:34
As I think about this, I might eliminate the delivery go routine, and instead, structure it as: 1. REST API go routine enters `RecvLocalTran` 2. It attempts to queue onto a buffered new tran channel of configurable size 3. If successful, it then pushes a `newTranEvent` into the event manager * If PBFT thread has a free virtual client spot, it reads a tran from the new tran channel * If the PFBT thread has no free virtual client spot, when one frees up, it will attempt to read from the new tran channel

jyellick
2016-06-23 17:34
Of course, if there is not room in the buffered channel, then we can do configurable behavior, either block until a timer expires, block indefinitely, or reject immediately

kostas
2016-06-23 17:38
What is the problem that you're trying to solve?

kostas
2016-06-23 17:40
Remember that my focus on this thread is on fairness --or lack thereof--, and the conclusion is that there is no unfairness in the existing mechanism. I take it we're switching to flow control?

jyellick
2016-06-23 17:47
Yes, this is to address flow control, not fairness

kostas
2016-06-23 17:48
`git stash; git checkout flow-control`

jyellick
2016-06-23 17:48
The system is 'fair' today. The suggestion to prioritize consensus messages over local trans, as a mechanism of flow control, exposes the possible lack of fairness. The proposal above adds flow control while maintaining fairness.

kostas
2016-06-23 17:57
That works for me. The proposal make sense.

sheehan
2016-06-23 19:39
@jyellick: I’m assuming https://github.com/hyperledger/fabric/pull/1928 is still in progress. Is that correct?

jyellick
2016-06-23 19:40
I think it's good to go... the exact same changeset was already merged into 0.5 I believe

jyellick
2016-06-23 19:42
@sheehan: ^

sheehan
2016-06-23 19:43
ok, so that dangling conversation is taken care of?

tuand
2016-06-23 19:44
@sheehan, PR #1971 should be labeled as 0.5

jyellick
2016-06-23 19:48
@sheehan: Yes, that PR was paired with one that was merged into 0.5 that the conversation continued on, I added a reply to that effect.

sheehan
2016-06-23 19:48
thanks!

jyellick
2016-06-23 19:49
Thank you

tuand
2016-06-23 20:08
@jyellick: can you take a look at #1942 ? behave test is in one of the comments ... i thought nullrequest=1s cleared it but it's behaving differently today. also changing k=10 also seems to trigger something ... looks like some number of requests are being duplicated

jyellick
2016-06-23 20:09
Sure, I will take a look, got a few other things pending, so may be a few minutes, but I will get to it

tuand
2016-06-23 20:10
danke !

sheehan
2016-06-23 20:25
@jyellick: Just had build fail on the master branch after https://github.com/hyperledger/fabric/pull/1928


jyellick
2016-06-23 20:25
Looking...


jyellick
2016-06-23 20:27
Running the tests locally off my branch to see if I can reproduce

sheehan
2016-06-23 20:28
It wasn’t there when Travis ran against the PR

jyellick
2016-06-23 20:28
They pass in my local non-merged branch

jyellick
2016-06-23 20:28
Fetching upstream to try again

sheehan
2016-06-23 20:29
It looks like that test case already passed in the next master branch build https://travis-ci.org/hyperledger/fabric/builds/139861848 so the error is not consistent

jyellick
2016-06-23 20:31
Ah, I see the bug

jyellick
2016-06-23 20:32
This should definitely be fixed, I imagine this could occur in production

sheehan
2016-06-23 20:32
you want me to open an issue?

jyellick
2016-06-23 20:32
Actually... this may be less serious than I thought, but obviously should be fixed

jyellick
2016-06-23 20:33
Yeah, this could cause a crash at startup

jyellick
2016-06-23 20:34
Easy verifiable fix, I can turn it around in 10 minutes


jyellick
2016-06-23 20:40
Thanks



jyellick
2016-06-23 20:50
@sheehan: Those are for master and v05 respectively

jyellick
2016-06-23 20:54
And, my apologies, despite what I had just told you about 1928, when I tried to cherry pick the commit in, I noticed that apparently one of the broadcaster commits that made it into 0.5 did not make it into the 1928 PR, so, I just added it back in to 1979.

sheehan
2016-06-23 20:56
np

jyellick
2016-06-23 21:11
@tuand: Does one of the 1942 behave tests reproduce the issue?

tuand
2016-06-23 21:13
leticia provided a behave test in #1942

tuand
2016-06-23 21:14
`lhaskins` comment timestamped "2 days ago"

tuand
2016-06-23 21:16
yesterday, this behave test worked with core_pbft_general_timeout_nullrequests=1s , today, it's been inconsistent. I also tried K = 2 or 10, sometimes when K=2 it'll work

jyellick
2016-06-23 21:55
@tuand: I may see what's going on here, verifying

yingfeng
2016-06-24 07:56
One of major engineering optimization of `raft` is to execute message sending and persistence in parallel. This could be seen from section `10.2.1` of [Ongaro's thesis](https://ramcloud.stanford.edu/~ongaro/thesis.pdf) . During the code inspection, I've found there exists many data persistences during each step of consensus, including `recvRequest`, `sendPrePrepare`,`sendPrepare`,`maybeSendcommit`. Could the persistence be refactored like the design of `raft` to improve the overall throughput?

cca
2016-06-24 09:14
@yingfeng: Remember we want to tolerate "byzantine" faults. Raft tolerates crashes. It is an interesting problem to find more efficient BFT protocols than PBFT, but there is no BFT version of Raft. (Yes I am aware there is a term paper with a protocol sketch titled BFT-Raft somewhere, but we are interested in protocols whose correctness is widely accepted.)

yingfeng
2016-06-24 09:38
@cca: I mentioned Raft here because I wonder if its existing engineering optimization could be applied to PBFT, since current throughput is far from satisfying for our production usage.

simon
2016-06-24 10:25
yingfeng: please show me data that pbft is the performance bottleneck

simon
2016-06-24 10:26
yingfeng: and we do message sending and execution in parallel.

yingfeng
2016-06-24 10:39
@simon the aim for our requirements is nearly 10k per peer node, do you think current pbft is enought to reach that aim?

simon
2016-06-24 10:39
no idea

simon
2016-06-24 10:40
but the chaincode execution cannot

simon
2016-06-24 10:40
the absolute maximum i could observe was 800tx/sec

simon
2016-06-24 10:40
with a null chaincode

simon
2016-06-24 10:40
with example02 i could get 400 i think

simon
2016-06-24 10:40
so no

simon
2016-06-24 10:40
you won't get 10k.

simon
2016-06-24 10:41
unless you hack the system and not use chaincode containers, etc.

yingfeng
2016-06-24 10:54
I also only get 700 for example02 using single peer. weeks before this value is 7000

simon
2016-06-24 11:06
no

simon
2016-06-24 11:06
never

simon
2016-06-24 11:06
you measured it wrong

simon
2016-06-24 11:12
you need to measure closed loop


simon
2016-06-24 11:12
try this branch

simon
2016-06-24 11:12
and use ?wait=20s on your invoke URL

simon
2016-06-24 11:13
then you will see the true performance

vukolic
2016-06-24 12:29
@yingfeng - this is why we will be separating execution from consensus


vukolic
2016-06-24 12:44
while allowing chaincode to have an easy way to do parallel execution

vukolic
2016-06-24 12:45
however, if chaincode does not (or cannot) leverage this - bottleneck of a single chaincode will *almost always* be execution

vukolic
2016-06-24 12:45
yet, we do not want a single chaincode execution to be blocking other hence we are moving away from monolitic design

vukolic
2016-06-24 12:46
in principle PBFT with small number of nodes, on a cluster will never be a bottleneck for a single chaincode

simon
2016-06-24 13:53
so what's the next thing for me to work on?

simon
2016-06-24 13:53
split request and message input

jyellick
2016-06-24 13:59
I think that's certainly something that needs to be done

sheehan
2016-06-24 14:01
looking to merge 1951, 1971, and 1980 to the 0.5 branch today. They all look good to go, but please let me know if I should wait on any of these.

simon
2016-06-24 14:06
sheehan: i have a suggestion for simplifying release management

simon
2016-06-24 14:06
i'd merge only to the release branch, and occasionally merge release to master

simon
2016-06-24 14:07
and for next time i suggest feature freeze and first bug fixes only before branching off release

jyellick
2016-06-24 14:07
@sheehan: Please also look at 1987, as the associated issue #1942 has been tagged for inclusion in 0.5

sheehan
2016-06-24 14:25
@simon: yes, agree about feature freeze. Need that next time. I don’t think it was communicated very well this time and I don’t think the amount of outstanding consensus work was understood

sheehan
2016-06-24 14:25
They we’re discussing a more formal process in the TSC call yesterday

simon
2016-06-24 14:26
well the problem is that we didn't get testing well ahead

simon
2016-06-24 14:27
for a month it looked like there were no more bugs

simon
2016-06-24 14:37
can we remove noops?

jyellick
2016-06-24 14:38
At the very least it would be nice to prevent noops with N > 1

simon
2016-06-24 14:39
but why even

simon
2016-06-24 14:39
pbft with N=1 F=0 works fine

jyellick
2016-06-24 14:39
I'd be fine with aliasing noops to be n=1, f=0

jyellick
2016-06-24 14:40
There may be some desire to leave multiple plugins in tree, that noops could be a good starting place for someone writing a new consensus plugin

jyellick
2016-06-24 14:41
I don't know that that's actually true. And I think it's important that we remove the direct ledger execution stuff from the consensus API, because it is not serialized against state transfer (and running both concurrently has been shown to cause panics and crashes)

simon
2016-06-24 14:45
ah i see

simon
2016-06-24 14:51
so if i use a channel, i will have to introduce knowledge of transactions into the event manager

simon
2016-06-24 14:51
or i need to create a new goroutine

jyellick
2016-06-24 14:52
Can you elaborate?

jyellick
2016-06-24 14:52
Did you see my proposal to @kostas yesterday?

simon
2016-06-24 14:53
oh i see

kostas
2016-06-24 14:53
Right that seems like a good way to do it.

simon
2016-06-24 14:53
have the engine not enqueue a transaction, but a transaction event

jyellick
2016-06-24 14:53
Exactly, then the event thread can go read off the channel if it decides to

simon
2016-06-24 14:54
but that introduces knowledge of the event manager to that interface

simon
2016-06-24 14:54
need a piping goroutine that just translates the queue contents

jyellick
2016-06-24 14:55
The way things are structured now, is that the `external.go` is the only file which go routines not belonging to the event manager should enter. Although it's not the case today, I also think that's the only place that should have a reference to the event manager. [`external.go` purposefully does not have a reference to the PBFT structures, to discourage any methods in them from accessing them directly, their interaction is through the event manager, which does the serializing]

jyellick
2016-06-24 14:56
That `external.go` is where the `RecvMsg` lives today, and it puts message events into the manager

jyellick
2016-06-24 14:58
If `RecvLocalTran` (or whatever better name) lives there with a buffered channel, it could queue that `newTranEvent` into the event manager. Then the PBFT internals processes that event, checks if it can handle a new tran right now, and then goes and reads off the buffered tran chan (which I assume would live in `external.go`)

jyellick
2016-06-24 14:58
I'm not seeing the "introduces knowledge of the event manager to that interface"

simon
2016-06-24 15:03
well, enqueing a "newTranEvent" is introducing knowledge of the event manager

jyellick
2016-06-24 15:03
I guess I would not say 'introducing', because `RecvMsg` already has knowledge

simon
2016-06-24 15:04
i mean the engine function

jyellick
2016-06-24 15:04
(as does every other call in `external.go`)

jyellick
2016-06-24 15:04
Oh, then I don't see it again

jyellick
2016-06-24 15:04
Oh, why not have the engine call into `RecvLocalTran` in `external.go`

simon
2016-06-24 15:04
i am trying to export a chan

simon
2016-06-24 15:04
so that the engine can decide what to do with it

jyellick
2016-06-24 15:05
Ah, I see, you want the decision over blocking/timing out/rejecting to be at the engine level, and not at the PBFT level?

simon
2016-06-24 15:06
i don't think pbft should make that call

simon
2016-06-24 15:07
otoh, we could just make it a blocking call

simon
2016-06-24 15:07
and if somebody wants to drop messages, they can introduce a queue ahead of that?

simon
2016-06-24 15:07
or whatever other, maybe more fair data structure

jyellick
2016-06-24 15:08
Hmmm

simon
2016-06-24 15:08
yea, that also means that we don't need a channel

simon
2016-06-24 15:08
i like that better

jyellick
2016-06-24 15:08
I do agree, this is better configured in the engine in PBFT

jyellick
2016-06-24 15:09
Then `RecvLocalTran` always blocks until PBFT reads it. And, the engine should never call it in parallel?

simon
2016-06-24 15:09
well it can

jyellick
2016-06-24 15:09
But it would lose ordering promises I would think?

simon
2016-06-24 15:09
but if it wants to establish some fairness, it will have to implement a different way

simon
2016-06-24 15:10
yes

simon
2016-06-24 15:10
also this matches better a possible RPC API

jyellick
2016-06-24 15:10
What does the implementation of making that call block until PBFT is ready to receive a tran look like?

simon
2016-06-24 15:11
for now, just enqueue into the normal event queue

simon
2016-06-24 15:12
later, have something that uses condition variables? counting semaphore, something

jyellick
2016-06-24 15:12
But then we are in the same situation as we are now? Not being able to defer receiving new transactions?

simon
2016-06-24 15:12
yes, i want to first transform the interface, then transform behavior

jyellick
2016-06-24 15:12
Ah okay

kostas
2016-06-24 15:20
What engine function do you refer to specifically?

simon
2016-06-24 15:25
the one that calls recvmsg

jyellick
2016-06-24 15:32
@sheehan: 1987 CI finished successfully

sheehan
2016-06-24 15:35
thanks, merged


jyellick
2016-06-24 17:36
When I see the duplicated deploy, they are both going into the same block, which I find a little odd/interesting

simon
2016-06-24 17:47
haha wat

jyellick
2016-06-24 17:47
Think I've got it... think it may be from not taking the currentExec into account

jyellick
2016-06-24 17:50
And, it's only on deploy, because deploys take forever

jyellick
2016-06-24 17:51
Or not, the fact that they're in the same block makes me think this isn't view change related

kostas
2016-06-24 17:52

kostas
2016-06-24 17:52
@simon, @tuand: How many `REST invoking chaincode...` statements would you expect to see in the logs generated from this test?

tuand
2016-06-24 17:54
in vp0's log, 9 invokes no ?

kostas
2016-06-24 17:55
Yet we see 10. And to make matters even more interesting, we see two chaincode invocations when the two nodes are down.

tuand
2016-06-24 17:59
9 invokes from the behave log


kostas
2016-06-24 18:00
If you try on my branch, can you tell me how many you get?

tuand
2016-06-24 18:06
on jyellick/issue-1942 with kchristidis/fix-184 , i see 9 invokes in vp0 log


jyellick
2016-06-24 18:15
Aha! 90% sure I've got it, checking now

kostas
2016-06-24 18:15

kostas
2016-06-24 18:15
Checking your logs now.

jyellick
2016-06-24 18:19
We do not clear the batch store on view change

jyellick
2016-06-24 18:19
So if we cycle 4 views, we look at our outstanding requests, see something, and so add it to the batchstore, but that 'something' is already in the batch store, so we get the duplicated deploys

simon
2016-06-24 18:20
there was so much churn recently

simon
2016-06-24 18:20
not surprised

simon
2016-06-24 18:21
i think for the next iteration we should not try to meet a deadline - it just makes it more likely that bugs slip through

jyellick
2016-06-24 18:22
Yes, I don't think deadlines make for great code. That's supposed to be the whole point of agile... push what's ready

simon
2016-06-24 18:23
and we need to have continuous testing

simon
2016-06-24 18:23
right now we have waterfall testing

jyellick
2016-06-24 18:25
And yet another bug... our request timer, and our batch timer are set to equal values.... if you fire exactly one request into the system, it will generally never execute, because we will view change before the batch expires

jyellick
2016-06-24 18:27
Clearing the batch store on viewchange, and dropping the batch timeout to be 1 second fix the behavior for busywork. I could instead increase the request timeout, what do you guys think? (@simon @tuand @kostas)

tuand
2016-06-24 18:38
increase request timeout ... less code changes ?

jyellick
2016-06-24 18:38
Code changes are needed regardless

jyellick
2016-06-24 18:39
It's "decrease batch timeout" or "increase request timeout"

jyellick
2016-06-24 18:39
If they are the same value, as they are today, then a single transaction will never execute

tuand
2016-06-24 18:42
i'd still go with increase request timeout ... might even help us if we run again into a huge transaction that takes too long to broadcast

jyellick
2016-06-24 18:44
Any other votes? @kostas @simon

kostas
2016-06-24 18:44
No particular preference here.

jyellick
2016-06-24 19:41
@tuand: I'm thinking it needs to be the batch timeout, not the request timeout. In particular, if people want to turn on null requests, in order to have the outstanding request timer work with today's code, the null requests must come less frequently than the request timeout, so, increasing the timeout means increasing the minimum value of the null requests, which seems problematic

tuand
2016-06-24 19:46
agreed ... should add a warning in config.yaml to have diff values for the timeouts ? don't cross the streams :grin:

jyellick
2016-06-24 19:56
@tuand Would you mind adding that to https://github.com/hyperledger/fabric/pull/2007 ? I will respond with an update

tuand
2016-06-24 20:06
done.

jyellick
2016-06-24 20:16
Thanks, and pushed the fix

scottz
2016-06-24 20:50
@kostas: I retested 1942 using 9f5666f. Code looks much improved: no duplicates and all three peers sync up correctly - BUT querying the bounced peer (before and after stopping a 4th peer) still produces the initially deployed values (inaccurate responses) for at least a minute or two, even though after it becomes a functioning member of the 3-peer consensus network.

jyellick
2016-06-24 20:53
@scottz I commented on the issue, this is "working as designed" from a PBFT perspective

jyellick
2016-06-24 20:57
In the future we may want to make some optimizations to help the rejoined peer recover faster, but if you wish to know a definitive point in time value, you must perform this as an invocation which is ordered by the network and wait for the result of that invocation. This is commonly referred to as a 'strong read'

kostas
2016-06-24 20:58
@scottz: This is also a point we bring up the in the BMX documentation and a common source of confusion. Jason has covered it nicely in the Github issue.

scottz
2016-06-24 21:03
Sure I will read more. Intuitively, I can accept that at the time when the peer joins the network when there are already 3 nodes working and reaching consensus without it. But when one of them drops, and that restarted node continues onwards as one of the remaining three AND there is consensus on subsequent transactions, then I would think at that time then it would have to be caught up in sync (and should give responses same as the other two peers would give). How can that NOT be true?

jyellick
2016-06-24 21:08
When the peer rejoins the network, it knows the state the network was last in, and that was "ordering requests 1 through 8". So, it gets a message from the primary saying "Let's all agree to put request A in position 3", and the replica says "Okay, that's between 1 and 8, I've got nothing in slot 3, that's fine with me", and then the primary says "Let's all agree to put request B in position 4", and likewise the new peer says "Okay, that's between 1 and 8, and I've got nothing in slot 4, that's fine with me" and so on. The old nodes, they've already agreed on what goes in positions 1 and 2, so they executed them, so when the position is agreed on for 3 through 8, they simply execute them. The recently restarted one doesn't know what goes in position 1 or 2, so, it keeps a log of what goes into 3 through 8, but, it can't execute anything yet, because it must execute in sequence.

jyellick
2016-06-24 21:11
Now, eventually, the primary says "Let's all agree to put request H in slot 9", and the restarted replica says "Nope, we're ordering requests 1 through 8 now, and I still don't have requests 1 or 2, maybe I missed those requests, or maybe the primary is being a jerk, either way, I'm not going to order this new request". So, now the network doesn't have enough nodes to make progress, and, this triggers a view change, where everyone agrees on a starting point, and a new leader. Once they agree on the starting point, the restarted replica realizes it doesn't have that starting point, so it performs state transfer, and because it's been listening from the beginning of this new starting point, it won't miss any transactions, and will be able to execute up to date once state transfer finishes.

jyellick
2016-06-24 21:14
So, the TL;DR version is, the restarted peer will help the network, and some new transactions may execute, but eventually the restarted peer gets in sync

kostas
2016-06-24 21:15
I'll get sth up in the Wiki as well.

scottz
2016-06-24 21:22
so they "help" the network with a "yes" vote but they don't really know if it makes sense with previous state. So if I was a peer who wanted to take over the world, I could zap and restart all the other peers, and while they were restarting I would advance my transactions and they would all blindly vote as zombies for awhile after they recover and then they would initiate their own state transfers eventually all matching mine. MuuuaHaHa.

tuand
2016-06-24 21:24
well, the zap and restart part might be a wee bit complicated :wink:

jyellick
2016-06-24 21:24
@scottz: The PBFT ordering is agnostic to content. We want to make sure no one is getting censored, but beyond that, it is really about everyone agreeing on a total ordering

jyellick
2016-06-24 21:25
The danger in something like bitcoin is that the blockchain forks and you end up being able to spend the same coin twice

jyellick
2016-06-24 21:26
So long as everyone agrees you give the coin to person A, and then you submit a transaction to give the coin to person B, there's really no problem

jyellick
2016-06-24 21:26
Because everyone agrees on the order, the second transaction will not execute successfully

jyellick
2016-06-24 21:27
In the same sense, in the bitcoin network, you could control 100% of the mining nodes, but you still could not falsify a transaction

jyellick
2016-06-24 21:27
Consensus is about getting everyone a consistent global ordering, and then the chaincode/ledger is what determines whether a transaction is 'valid'

scottz
2016-06-24 21:33
@tuand: If I was an evil genius, I would have a big laser zapper. OK, So, Back to my specific case: what you are saying is I cannot query a node that has restarted until it sync's, but I don't know how/when/why that happens. How big is the queue of ordered transactions (1 through 8, in your example)? Can I expect it to sync up after a certain number of transactions after recovering, or in lieu of that maybe after a certain timer pops after 1 minute? How can I write a reliable repeatable maintainable predictable deterministic test case for this?

jyellick
2016-06-24 21:35
If you want to know the ledger's point in time state, you need to submit an invoke transaction who's output contains your desired value. Then wait for it to appear in the ledger, and you will know the value in the network at the point in time that transaction executed.

scottz
2016-06-24 21:36
or rather, " a certain number of transactions, plus the time it takes to complete the viewchange and also for that node to do a state transfer"

jyellick
2016-06-24 21:36
In the future, we hope to add a 'strong read' API which will be like a query, but go through consensus, unlike a normal query, which executes without ordering

jyellick
2016-06-24 21:38
There is a checkpoint interval and log size multiplier in config.yaml

jyellick
2016-06-24 21:38
If you multiply those together, you will get the number of PBFT sequence numbers that may execute while not being up to date

jyellick
2016-06-24 21:38
Multiply this by the batch size for an upper bound on the number of transactions

scottz
2016-06-24 21:42
right now in my view, that would 80. K: 10 logmultiplier: 4 batchsize: 2.

scottz
2016-06-24 21:50
OK, so if I am a user, and I query and it looks like I have $100, I cannot believe it (using the type of Query that we have implemented today). And if I try to withdraw it, you are saying that I won't get anything from my peer (or ledger) until enough transactions occur for my peer to sync with the ledger. So if my peer who told me I had $100 was recently restarted, it could have been wrong, so I just have to accept that my next request (withdrawal) might simply be rejected, depending on whatever happened while it was restarting. But overall if the other peers knew I had only $9, then the right result would occur (i.e. my $100 withdrawal request would be rejected, because I didn't actually have $100).

scottz
2016-06-24 21:51
But after that, once I know a peer is in sync (40 transactions), my test scripts can believe and depend on any query results, right?

scottz
2016-06-24 21:57
However... at any point in time, a client today would never really know if the peer it is querying was just restarted (and thus possibley out of sync) or rather if it was in sync (and therefore supplying reliable data). Hmmmm...

scottz
2016-06-24 21:59
I really appreciate your explanations! Very helpful.

jyellick
2016-06-24 22:04
@scottz Your transactions should always validate their inputs. Think about it like writing a check, you may check your account balance online and it says $100, but you wrote 12 checks for $50 each, that haven't been cashed yet. There's nothing stopping you from writing another check. The nice thing about the blockchain is that your check is now associated with a complete transaction like "transfer $50 in exchange for X", so it executes atomically, if you don't have the funds to back it up, then the transaction won't occur (unlike with a real check, where these things are detached)

jyellick
2016-06-24 22:05
Even discount the restarted peer, you can't guarantee that between your query and your transaction that something hasn't changed. Your queries should always return data that was "right at some point", but may not be "right when it was sent", and certainly not "right when it was received"

jyellick
2016-06-24 22:06
I'd love to see the API expanded to include a timestamp about the data, that "This was your balance at XXXX time"

jyellick
2016-06-24 22:06
And I know there is some work pending in this area.

scottz
2016-06-24 22:35
yes, and if there were any transactions pending at that point in time too (so I could determine if my last n transactions are still in queue or not, so I know how to interpret the info I receive back). Or, to know when a given transaction is processed and entered into the ledger, not just received. (Tuan explained to me there is an event notification system being planned too.) Then I could check my checkbook transactions and determine if the query response makes sense for that point in time.

gennady.laventman
2016-06-26 08:41
has joined #fabric-consensus-dev

simon
2016-06-27 13:02
jyellick: you around?

jyellick
2016-06-27 13:20
Yep

jyellick
2016-06-27 13:20
What's up?

jyellick
2016-06-27 13:20
^@simon

simon
2016-06-27 13:20
hi

simon
2016-06-27 13:21
trying to implement that sub-client registry

simon
2016-06-27 13:21
wondering how to hook it into the manager

jyellick
2016-06-27 13:21
Ah, thought you were doing the Consensus/Transaction split

simon
2016-06-27 13:22
i did already

jyellick
2016-06-27 13:22
Including the flow control?


simon
2016-06-27 13:22
no, for flow control i need the structure that allows me to do flow control

simon
2016-06-27 13:23
i.e. a structure that will allow me to take a new transaction

jyellick
2016-06-27 13:24
I'm not sure what you mean

simon
2016-06-27 13:31
i need to make RecvRequest() unblock when a previous request has executed, and now a new slot is free

simon
2016-06-27 13:32
so i need a structure that keeps record of these slots, and a way to communicate to the RecvRequest routine that it can continue

simon
2016-06-27 13:32
i guess recvrequest can read from a channel

simon
2016-06-27 13:32
and whenever a request is executing, we write to that channel

simon
2016-06-27 13:33
and we just use the context of that recvrequest routine

simon
2016-06-27 13:33
that would work

simon
2016-06-27 13:34
@jyellick: did you want to implement the sub-client stuff?

simon
2016-06-27 13:34
then i'll just mock up a small counting thing

kostas
2016-06-27 13:35
(counting what?)

simon
2016-06-27 13:40
outstanding requests

jyellick
2016-06-27 13:54
@simon: If you'd like to implement that, it's fine, there is plenty of work to go around, just thought it was on my plate

simon
2016-06-27 13:54
ah no

simon
2016-06-27 13:54
please go ahead

simon
2016-06-27 13:54
you must have something in mind

simon
2016-06-27 13:55
i started with this

simon
2016-06-27 13:55
```type clientTxStore struct { Ready chan int txToClient map[string]int freeClient map[int]struct{} } ```

simon
2016-06-27 13:55
but not sure it is the right choice

simon
2016-06-27 13:55
and i don't know how to put the subclient into the request

jyellick
2016-06-27 14:00
I assumed we would need to modify the Request message format

kostas
2016-06-27 14:01
so that we include the slot?

jyellick
2016-06-27 14:01
Right

jyellick
2016-06-27 14:03
Another thing I've considered, what would everyone think about allowing the PrePrepare messages to include a repeated Request section

jyellick
2016-06-27 14:03
The marshaling and unmarshaling back and forth between batch and core seems inefficient and unnecessary

jyellick
2016-06-27 14:04
(Especially as they merge)

simon
2016-06-27 14:04
yes

simon
2016-06-27 14:04
i thought about that too

kostas
2016-06-27 14:04
I agree

simon
2016-06-27 14:04
do we now work on master or on release?

kostas
2016-06-27 14:04
master

simon
2016-06-27 14:04
i think there should be a merge of release into master

simon
2016-06-27 14:05
because there are commits in release that are not in master, no?

jyellick
2016-06-27 14:05
I think that would have been the clean way to do things, but we should have frozen master and did not

simon
2016-06-27 14:05
well yes

simon
2016-06-27 14:05
not out of a release playbook

jyellick
2016-06-27 14:05
I believe I have rebased / cherry-picked commits in both

simon
2016-06-27 14:05
who came up with that release process?

jyellick
2016-06-27 14:05
(although also a number of outstanding PRs)

jyellick
2016-06-27 14:06
Not really sure...

simon
2016-06-27 14:06
i thought the linux foundation knew about these things

simon
2016-06-27 14:06
doesn't seem like it, honestly

simon
2016-06-27 14:06
i've done releases with dragonfly before - it's not that complicated

simon
2016-06-27 14:06
so everything is in master?

simon
2016-06-27 14:07
because so far i've been working on release

simon
2016-06-27 14:07
given that more stuff went into release than into master

jyellick
2016-06-27 14:07
I believe new features should be being built on master, but there's a big PR backlog against master

simon
2016-06-27 14:08
sure, but i'm not going to build new stuff against master unless master contains all the bug fixes, etc.

jyellick
2016-06-27 14:17
I've pushed a parallel PR to master for every PR that's gone to release

jyellick
2016-06-27 14:18
(Granted, 4 of them are outstanding, though one of those is also not yet in release)

kostas
2016-06-27 14:21
This is being discussed in the technical planning call right now by the way.

simon
2016-06-27 14:22
nope, there are still some not committed

simon
2016-06-27 14:22
i have no invitation to any technical planing call

jyellick
2016-06-27 14:25
Forwarded to you

simon
2016-06-27 14:27
i guess now is too late

jyellick
2016-06-27 14:28
They're still discussing to some extent... think you could still speak up

simon
2016-06-27 14:29
i don't even know the swiss dial in number - if they're interested in my input, they can ask me directly

jyellick
2016-06-27 14:31
Ah, heard someone connect, had assumed it was you

simon
2016-06-27 14:31
nope

kostas
2016-06-27 14:33
You can join the meeting online. (Had to disconnect as I had to join another mtg so I don't know if it's still on, but keep it in mind for next time.)

cca
2016-06-27 14:34
the webex doesnt run on my linux, i need a phone #...

simon
2016-06-27 14:35
same here

jyellick
2016-06-27 14:35
@cca I had the same problem... https://github.com/fgsch/docker-webex

jyellick
2016-06-27 14:35
Works surprisingly well

kostas
2016-06-27 14:35
Ah, I see.

jyellick
2016-06-27 14:36
(Though I've not tried to do audio through it, just looking at the screen share)

simon
2016-06-27 14:36
oh my

jyellick
2016-06-27 14:36
Rather than litter my system with obsolete random 32 bit libraries... docker seemed like a good option

simon
2016-06-27 14:39
should that request counting thing live in batch or in externaleventreceiver?

jyellick
2016-06-27 14:41
As I envisioned it, there would be some sort of queue in externaleventreceiver, which would preserve ordering (rather than just having a bunch of different go routines waiting on a channel, for instance)

jyellick
2016-06-27 14:42
Then, batch would remove the first item from the queue whenever it has room to do so

jyellick
2016-06-27 14:43
I think we need to decide how the engine side is going to work first

jyellick
2016-06-27 14:43
If the engine side only ever sends in one transaction at a time, then blocking on an unbuffered channel would be the way to do it, waiting for batch to come read from it

simon
2016-06-27 14:44
so i think if we block in RecvRequest, we can just inject a new event when we unblock

jyellick
2016-06-27 14:44
But what causes us to unblock?

simon
2016-06-27 14:44
execute writes to a chan

jyellick
2016-06-27 14:47
I'm not sure that I like that. I would prefer a simple flag which indicates whether a request is pending or not (this would be set to true whever an `newBlockedRequestEvent` came in, and cleared once read). If it is, then when we get an execution done event, or maybe a state transfered event, or whatever, then the event manager thread would simply go read from an unbuffered `pendingRequest` channel in `externalEventReceiver`

jyellick
2016-06-27 14:49
I suppose it would be easy enough to have a `unblockOneRequest` call, and do it that was instead, which would be invoked in the execute path, or wherever else.

simon
2016-06-27 14:52
i don't think i understand

jyellick
2016-06-27 14:54
So, the proposed ingress path would be: 1. Enter `RecvRequest`, pbft `pendingRequest` is false 2. Send the manager a `newBlockedRequestEvent`, once processed `pendingRequest` is true 3. Block on an unbuffered channel, waiting for the manager thread to read the request from this channel 4. Unblock as PBFT reads from this channel, `pendingRequest` is now false

simon
2016-06-27 14:54
also, how does the sub-client id get communicated?

jyellick
2016-06-27 14:54
So, `RecvRequest` is only for locally generated trans, no? So we can assign any free virtual client ID

simon
2016-06-27 14:55
yes

simon
2016-06-27 14:55
i don't understand your path, sorry

jyellick
2016-06-27 14:55
Maybe I can clarify

simon
2016-06-27 14:56
so you want to introduce knowledge of two channels to the manager?

jyellick
2016-06-27 14:58
Not to the manager, no. The manager would have no knowledge of the other channel. The second channel would live in `external.go` and PBFT would maintain state, as to whether a transaction is waiting on that channel or not. Whenever a go routine arrived with a tran, it would send an event via the event manager, to tell PBFT to update its state that a new transaction is waiting. When PBFT is able to absorb a new transaction, it checks to see if a tran is waiting, and if so, goes and reads off the channel in `external.go`

simon
2016-06-27 14:59
hmm

jyellick
2016-06-27 15:02
So, I like your model because the ingress of the trans is through the event manager

jyellick
2016-06-27 15:03
The thing I don't like about it, is that PBFT doesn't know whether or not there's an outstanding tran until it arrives. Maybe that's not a problem

jyellick
2016-06-27 15:12
I think I've talked myself out of the ingress model I proposed, I'm back on board with how you were planning to do it, basically a toggle to block incoming transactions we flip on once our client slots are full, and flip off once there is room.

simon
2016-06-27 16:30
#1171 is the issue we're working on

jyellick
2016-06-27 16:47
Yep

jyellick
2016-06-27 16:55
@kostas What is your status for merging batch/core?

kostas
2016-06-27 16:56
On whitelisting now, batch/core comes next. As we discussed last week, we can spread the tasks. Are you about to start on this?

jyellick
2016-06-27 16:58
I'm trying to figure out exactly how to fit the work on

jyellick
2016-06-27 16:59
As I look at this slotting stuff, trying to decide if I should implement it in batch or core, or, on top of some merged batch/core.

kostas
2016-06-27 17:00
I was actually wondering whether that would go into batch or core.

kostas
2016-06-27 17:00
(If it goes in the merged batch/core, the dilemma goes away :simple_smile: )

jyellick
2016-06-27 17:01
Exactly. The right place I think is in core, but, it doesn't fit there today, so it could be done in batch for now, but that means it's one more thing to merge into core.

kostas
2016-06-27 17:01
Shall we tackle the unification now?

kostas
2016-06-27 17:01
I can set the whitelisting aside for a few days.

jyellick
2016-06-27 17:02
Up to you, but if you are willing, I think unification sooner is better than later

kostas
2016-06-27 17:03
Agreed; I'm in.

jyellick
2016-06-27 17:03
I am on site in RTP today if you are, could come chat physically if you think that would be faster

kostas
2016-06-27 17:04
Let's do that and then we can document the process in an issue on Github.

jyellick
2016-06-27 17:05
Sounds good

kostas
2016-06-27 19:39
So if we are to remove `sieve` and all `obc-classic` references we should eventually be looking at a `consensus` package that looks roughly like this: ``` consensus.go controller package executor package helper package noops package pbft package util package ```

kostas
2016-06-27 19:39
Where the `util` package contains the `events` package.

kostas
2016-06-27 19:40
And within the `pbft` package all `obc-*.go` files are replaced by the `pbft.go` + `pbft_test.go` pair.

jyellick
2016-06-27 19:48
Still running CI, but here is the initial PR to remove Sieve and its references: https://github.com/hyperledger/fabric/pull/2030

hgabor
2016-06-27 20:18
Hi

hgabor
2016-06-27 20:19
Which is that part of the code that is executed right after a consensus is made e.g. pbft or a noops (as far that is a consensus)? If I m right it is exectxs in helper

jyellick
2016-06-27 20:23
@hgabor Ultimately, to create a block first call `BeginTxBatch` followed by `ExecTxs` followed by `CommitTxBatch`. All of the consensus plugins follow this pattern in some fashion.

jyellick
2016-06-27 20:24
Be aware, that PBFT now uses the `executor` package to perform this task, in order to serialize executions and state transfer, as calling those methods references above while in state transfer can cause a panic. Noops never triggers state transfer, so its usage is safe.

hgabor
2016-06-27 20:24
Is it possible that obcpbft batch calls exectxs more than once in a row?

jyellick
2016-06-27 20:25
It should not be today, no.

hgabor
2016-06-27 20:26
I was 'playing' with it and it did so. I made some changes maybe that caused the problematic operation

jyellick
2016-06-27 20:28
Ah. yes. obcpbft batch uses `executor.go`, you can take a look there, it should be okay to multiply invoke execute, but the code path is usually invoke `Execute` (which invokes `BeginTxBatch` if necessary, then `ExecTxs` and calls back), which then invokes `Commit` (which invokes `CommitTxBatch`)

jyellick
2016-06-27 20:29
However, be careful to wait until the callback has been made before invoking the second `Execute`, if the callback is still pending, you will likely deadlock that node's consensus.

hgabor
2016-06-27 20:29
It seemed to log a critical message saying that Replica %d is missing request for seqNo=%d with digest '%s' for assigned prepare after fetching, this indicates a serious bug

jyellick
2016-06-27 20:29
Which commit level of the code are you at? There was a benign scenario which would spew that message erroneously.

hgabor
2016-06-27 20:30
I think I was at committed

hgabor
2016-06-27 20:30
But not sure

hgabor
2016-06-27 20:32
When can that message be spewed?

hgabor
2016-06-27 20:32
In what scenario exactly


jyellick
2016-06-27 20:35
You can check to make sure your code has that second check for a digest of ""

jyellick
2016-06-27 20:36
Essentially, on view change, if there are null requests included in the Xset, then we will not have a corresponding request in our request store, and that message will be displayed. This is benign because null requests are a sort of psuedo-request which we never expect to be in our request store. When this error message was added, that possibility was not taken into account.

hgabor
2016-06-27 20:37
Oh I see

hgabor
2016-06-27 20:38
And let me ask more about exectxs and pbft batch. If you have only 1 TX, I mean the client has sent only one TX, is it possible that exectxs gets called more than once?

jyellick
2016-06-27 20:40
No, it should not be. There was a bug, fixed in PR #2007 / #2008 (for 0.5 / master respectively), which would sometimes cause transaction duplication, especially deploys.

jyellick
2016-06-27 20:40
(your question makes me think you might be observing that behavior)

hgabor
2016-06-27 20:43
I am doing my experiments with a week old master. May that bug affect that?

kostas
2016-06-27 20:43
(^^ Yes.)

jyellick
2016-06-27 20:43
Definitely, that was only merged into 0.5 earlier today, and is still not merged into master.

jyellick
2016-06-27 20:43
I strongly recommend that you do your experimentation with the current 0.5 branch.

hgabor
2016-06-27 20:44
BTW aren't you talking about the "system chain code deployment bug"? You mentioned deploy

jyellick
2016-06-27 20:44
There was a significant amount of churn and bug fixing in obcpbft over the last week.

jyellick
2016-06-27 20:44
I'm not certain, do you have an issue number?

hgabor
2016-06-27 20:45
Ok I will switch to that :-)

hgabor
2016-06-27 20:45
No but I m trying to find one


jyellick
2016-06-27 20:47
^ This is an issue which reported the symptoms, though the fix was already in queue

hgabor
2016-06-27 20:51
I may have something similar but with invoke txs

jyellick
2016-06-27 20:51
Yes, it can happen with invokes as well as deploys

jyellick
2016-06-27 20:51
Especially if you only send a single one

jyellick
2016-06-27 20:51
Then you will end up with a duplicated invoke tx

jyellick
2016-06-27 20:52
It just so happens that in most of the test, people send a single deploy tx, wait for it to finish, and then send many invokes.

hgabor
2016-06-27 20:52
I will try the 0.5 branch and inform you what happened and whether I had that problem again

jyellick
2016-06-27 20:52
Great, good luck

hgabor
2016-06-27 20:55
Thanks for the help, I will check out the consensus modules again and maybe I will have new questions :-P

jyellick
2016-06-27 20:55
You're welcome, happy to help, we'll be here.

hgabor
2016-06-27 20:55
@hgabor pinned a message to this channel.

jyellick
2016-06-27 20:56
@kostas @tuand @simon If you guys have a chance, could you review and sign off on 1976 and 2030?

jeffprestes
2016-06-27 21:30
has joined #fabric-consensus-dev

simon
2016-06-28 12:54
hah, i just ran into an issue where with N=1, F=0, i stopped the replica (probably while some requests were in flight), and now the replica won't process request anymore

simon
2016-06-28 12:55
because lastExec is 4867, and seqNo is 4872

simon
2016-06-28 12:55
maybe a view change would fix it?

simon
2016-06-28 13:30
if somebody could look at the split-request-ingress branch

simon
2016-06-28 13:30
@jyellick, @kostas: you've been talking about the design

simon
2016-06-28 13:30
having it in the external event receiver is ugly

jyellick
2016-06-28 13:31
@simon: Sure, I can take a look

kostas
2016-06-28 13:31
looking


simon
2016-06-28 13:31
so i'd appreciate some ideas how to improve that

simon
2016-06-28 13:31
but it seems to be working

simon
2016-06-28 13:32
meaning, it is closed loop now

simon
2016-06-28 13:32
yey

simon
2016-06-28 13:32
without my hack

jyellick
2016-06-28 13:36
@simon: I discussed this some with @kostas, I could not figure out why you wanted to assign the ID in `external.go` instead of simply sending in the transaction

simon
2016-06-28 13:37
it was the first data structure that came to my mind

simon
2016-06-28 13:37
i'm fine with whatever

jyellick
2016-06-28 13:37
Okay, then I think `external.go` can be cleaned up pretty simply

simon
2016-06-28 13:37
the nice thing now is that we can reject duplicate transactions :slightly_smiling_face:

simon
2016-06-28 13:37
which is more a protection of the data structure than anything else

jyellick
2016-06-28 13:41
@simon @kostas @tuand Anyone have a chance to look at https://github.com/hyperledger/fabric/pull/2030 ?

jyellick
2016-06-28 13:43
(Actually, saw your remark @tuand we can post to slack before merge, but as we just forked off the release, no one seems to be too worried about potential breakage, if anyone wants something stable to play with, they should use the dev preview, as master is likely to be in serious flux)

tuand
2016-06-28 13:45
np ... good point about advertising on slack in advance

simon
2016-06-28 13:49
oh man, closed loop

simon
2016-06-28 13:49
so fantastic

jyellick
2016-06-28 13:59
@simon: Is it safe for me to base my work from your branch?

simon
2016-06-28 14:00
i think we should remove the closed loop hack first

simon
2016-06-28 14:00
which is in the middle of the telemetry branch

simon
2016-06-28 14:01
but you can also take the two commits and rebase them onto anything you want

simon
2016-06-28 14:01
either way

jyellick
2016-06-28 14:01
Alright

simon
2016-06-28 14:03
i'll remove the hack and push again

simon
2016-06-28 14:04
ok

jyellick
2016-06-28 14:05
Just cherry picked those commits onto a fork of the sieveless branch, hopefully good enough to work from there

simon
2016-06-28 14:09
should i merge your sieveless branch?

kostas
2016-06-28 14:09
(Hope the answer is yes, I already have)

simon
2016-06-28 14:09
okay

simon
2016-06-28 14:09
i can rebase on top of the sieveless branch

jyellick
2016-06-28 14:10
If you guys vote your approval on the PR, we can probably get it pulled into master today

kostas
2016-06-28 14:10
I don't think it's an issue of lack of approvals, but a matter of Sheehan & co. being backlogged

simon
2016-06-28 14:12
jyellick: maybe you can merge your sieveless branch into my branch, that way we all have the same history

jyellick
2016-06-28 14:13
Sure, or just push your branch and I can rebase onto it

simon
2016-06-28 14:13
i did

jyellick
2016-06-28 14:13
simon/split-request-ingress ?

simon
2016-06-28 14:14
yep

jyellick
2016-06-28 14:14
Thanks

simon
2016-06-28 14:15
so what is missing is function doc and tests

simon
2016-06-28 14:15
but i guess you'll change that stuff anyways

jyellick
2016-06-28 14:16
The reqqueue stuff?

simon
2016-06-28 14:16
yea

jyellick
2016-06-28 14:17
That's the plan

kostas
2016-06-28 14:18
as for simplifying `RecvRequest` further, do we agree that this is as simple as it can go? ```func (eer *externalEventReceiver) RecvRequest(tx *pb.Transaction) error { <- eer.reqQueue.GetReady() eer.manager.Queue() <- transactionEvent{tx} return nil }```

kostas
2016-06-28 14:19
(and is that less ugly?)

simon
2016-06-28 14:19
`WaitReady()` maybe

kostas
2016-06-28 14:19
right, that seems like a better method name

kostas
2016-06-28 14:21
(and I guess the thinking is that you call the reqQueue's `Register` from the manager's `ProcessEvent`)

jyellick
2016-06-28 14:22
I'd rather get the `reqQueue` out of `external.go` entirely, have it be a very simple channel logic in `external.go` and then deal with the more complicated stuff on the other side. ``` func (eer *externalEventReceiver) RecvRequest(tx *pb.Transaction) error { eer.manager.Queue() <- eer.createTxEvent(tx) return nil } ```

jyellick
2016-06-28 14:23
Where here `createTxEvent` waits for some channel to unblock

simon
2016-06-28 14:23
but createTxEvent is still part of external.go?

kostas
2016-06-28 14:23
where that channel is the reqQueue's Ready channel I presume?

kostas
2016-06-28 14:24
if that's the case, where do you hold the reqQueue?

jyellick
2016-06-28 14:25
Yes, `createTxEvent` is still part of `external.go` but there's just a simple counter that PBFT can hit saying "I've got a slot available", and that call (`createTxEvent`) blocks unless that counter is greater than 0 (implemented as a buffered channel)

jyellick
2016-06-28 14:26
There would be no queueing on the `external.go` side. If you want correct ordering, call it serially.

simon
2016-06-28 14:26
where is the diffierence?

jyellick
2016-06-28 14:26
Purely that the queue state is managed inside of PBFT, rather than `external.go`

simon
2016-06-28 14:26
so createTxEvent() would just do <- somechannel

jyellick
2016-06-28 14:26
More or less

simon
2016-06-28 14:27
how do you envision counting and unblocking?

simon
2016-06-28 14:28
it still would be a buffered chan of the same size?

jyellick
2016-06-28 14:28
Yes

simon
2016-06-28 14:28
okay

kostas
2016-06-28 14:28
That does look like a cleaner approach.

simon
2016-06-28 14:29
aside: i want to start using panic() when a function is called incorrectly

simon
2016-06-28 14:29
not about network data being incorrect, but function with incorrect data/state

jyellick
2016-06-28 14:30
It would make the code cleaner

tuand
2016-06-28 14:33
send an event before calling panic() ?

simon
2016-06-28 14:33
what event?

simon
2016-06-28 14:34
panic is just to abort the code because there is a clear programming bug

tuand
2016-06-28 14:35
using our event framework ... whoever's monitoring the network might not be sitting at a console

simon
2016-06-28 14:35
whoever is running the network better have a pager connected to when a node crashes

tuand
2016-06-28 14:37
event would also automatically log reason for crash

simon
2016-06-28 14:38
events log something?

tuand
2016-06-28 14:39
i would hope that there is a listener that is monitoring events from all peers

jyellick
2016-06-28 15:19
An announcement for anyone who is still using the "sieve" or "classic" PBFT variants. There is a pending PR ( https://github.com/hyperledger/fabric/pull/2030 ) which removes the "sieve" and "classic" PBFT variants from the fabric master branch. Unless there is some new opposition, hopefully this PR will be merged later today. Anyone relying on these PBFT variants in their own scripts or automation should modify them to use the "batch" PBFT mode. ^ Was about to send this to the fabric mailing list, unless anyone has any suggestions?

simon
2016-06-28 15:19
go for it

kostas
2016-06-28 15:19
ship it

simon
2016-06-28 15:19
and then we also remove noops

kostas
2016-06-28 15:20
and drop the `obc` prefix from the entire package

simon
2016-06-28 15:20
oh you and your OCD :slightly_smiling_face:

kostas
2016-06-28 15:20
I've been getting better at it, but still, a long way to go admittedly.

jyellick
2016-06-28 15:28
Do we have a plan for ensuring that all replicas have the same config? There are some obvious misconfigurations, like mismatched checkpoint intervals which could break us, but some of the new stuff like 'outstandingrequests' being out of sync might also break things in subtle, not so easy to spot ways


kostas
2016-06-28 15:29
I'm having the same concerns, which is why I resurfaced this during the weekend.

kostas
2016-06-28 15:34
Step 4 is still not clear to me as I'd like it to be. I can make assumptions but I'm not sure about their validity. One way of interpreting it: the chain creator creates the genesis block (`$make genesis` or whatever). How is this distributed to all nodes? An off-band process ("make sure you have this file in this directory before you bring up your node"), or a process that happens during hand-shaking? Simon's last comment makes me think we're going for the former.

simon
2016-06-28 15:36
yes, of course

jyellick
2016-06-28 15:37
Bootstrapping is notoriously hard to do securely and automatically. Even if we say "make sure the config is the same when you first start so you get matching genesis blocks" and then verify it on startup, it's effectively the same as distributing the genesis block.

jyellick
2016-06-28 15:50
@simon @vukolic @cca @kostas My reading of the paper indicates that a byzantine client could broadcast a request to only f+1 backups, and force a view change. It could continue to do this indefinitely, forcing the network to constantly change views. This seems like a problem to me? Is this something that has been addressed before?

cca
2016-06-28 15:56
@jyellick: referring to the PBFT paper? wouldnt the client in our case send the request to the leader first anyway, adn the leader inserts it into the system?

jyellick
2016-06-28 15:57
@cca I'm proposing a byzantine client, who is trying to slow the throughput of the system (And yes, to the Castro paper)

cca
2016-06-28 15:58
but, isnt the client here supposed to send it to the leader?

kostas
2016-06-28 16:00
The point is that a client can follow this tactic to overthrow a normally-functioning leader, correct?

jyellick
2016-06-28 16:01
Yes, maybe I am missing something here, but the client is byzantine, and intentionally not following the protocol. So instead of broadcasting to all replicas, it is only broadcasting to f+1, which it knows are *not* the leader.

cca
2016-06-28 16:01
certainly possible, sounds like things discussed in the Aardvark paper and "Byzantine replication under attack" by Yair Aimr et al

jyellick
2016-06-28 16:03
My concern is that, because in the fabric scheme, the replicas themselves act as clients. So, a single byzantine replica could essentially force a view change until that replica becomes the primary. With f conspiring together, it seems that they could force the network usually be led by a byzantine replica.

cca
2016-06-28 16:05
yes - read these 2 papers (aardvark = Making Byzantine Fault Tolerant Systems Tolerate Byzantine Faults)

cca
2016-06-28 16:05
given the FLP impossibility in the asynchronous model, with enough asynchrony the system never gets anything useful done. only randomized can give this

jyellick
2016-06-28 16:09
I will try to give those papers a read. Would you suggest that we simply ignore this sort of attack for the time being, or are there some proactive steps we can take now in development to make adapting to these easier in the future?

cca
2016-06-28 16:09
ignore that for now

cca
2016-06-28 16:10
the quick remedies are pretty simple, i recall, see papers

vukolic
2016-06-28 17:08
@jyellick

vukolic
2016-06-28 17:08
not sure what PBFT is exactly doing - how client retransmission should be done is as follows

vukolic
2016-06-28 17:09
client is supposed to resend to f+1 or more replicas

vukolic
2016-06-28 17:09
who then forward the request to the primary

vukolic
2016-06-28 17:09
and only then replicas fire the timer

vukolic
2016-06-28 17:09
which prevents the scenario you are describing above

vukolic
2016-06-28 17:09
now this may be departing from PBFT - and in this case yes this is a bug in the paper


vukolic
2016-06-28 17:11
(I just saw that @cca already pointed to that paper)

jyellick
2016-06-28 17:15
@vukolic I've been discussing with @kostas, and am becoming increasingly convinced, that unless requests are signed (so that their origin cannot be forged), many of these problems are not solvable. In particular, in order to avoid executing requests multiple times, we must filter out requests we receive which are older than our 'last executed time' for that particular client. This is because by the time we receive a request from a client, in an asynchronous network, the network may have already executed that request. If a malicious client/replica can forge a request from far in the future for a client, it may effectively censor that client indefinitely.

vukolic
2016-06-28 17:16
in HL fabric theory - client signs all requests

vukolic
2016-06-28 17:16
however this is now done, I believe, only if "security is turned on"

vukolic
2016-06-28 17:16
which painfully slows the things...

vukolic
2016-06-28 17:16
so we need to revisit that going towards v2

jyellick
2016-06-28 17:17
So I think the problem is that a HL fabric 'request' is a transaction, but PBFT operates on `Request` messages, which contain a transaction, plus some other information like originating replica.

jyellick
2016-06-28 17:19
So, the actual `Request` structure is never signed, and consequently, any replica can forge a `Request` as originating from another replica in a scenario as you described such as "forwarding to the primary"

vukolic
2016-06-28 17:19
the short answer is - PBFT request must be signed

vukolic
2016-06-28 17:19
so we agree

vukolic
2016-06-28 17:19
let's fix this in v2

jyellick
2016-06-28 17:20
Do you think we could instead use a scheme of REQUEST-ACK such as the VIEW-CHANGE-ACK described in the Castro paper?


vukolic
2016-06-28 17:21
it describes other attacks which are possible if you do not have client's signature (cf. BigMac attack)

vukolic
2016-06-28 17:22
now I do not know what you refer to with REQ-ACK - but if this is too chatty as in all to all communication - I would not do it - signature is cleaner and better and simpler

kostas
2016-06-28 17:22
Signed requests in v2 will align neatly with the signed view-change messages.

vukolic
2016-06-28 17:22
I think we have those @simon

jyellick
2016-06-28 17:23
We already do signatures, yes

vukolic
2016-06-28 17:23
because this is our departure from PBFT

vukolic
2016-06-28 17:23
as unsigned view change msgs are merely an academic showcase

jyellick
2016-06-28 17:23
In the Castro paper, rather than sign view change messages, replicas reply to a VIEW-CHANGE with a VIEW-CHANGE-ACK, and a view change message is only considered to be valid after it has a quorum cert of VIEW-CHANGE-ACKs (or something similar)

kostas
2016-06-28 17:23
Right, right. I just sent Jason your slides on this yesterday.

vukolic
2016-06-28 17:23
of the fact that it is possible to have signature-free protocol

vukolic
2016-06-28 17:23
which was interesting at that time, for, well, academic reasons...

vukolic
2016-06-28 17:23
so we dropped that early on

jyellick
2016-06-28 17:24
Right, I wasn't sure whether it was a coding optimization, or a performance one.

vukolic
2016-06-28 17:24
I felt it is both :slightly_smiling_face:

jyellick
2016-06-28 17:24
As certainly the implementation without signatures is more complicated.

vukolic
2016-06-28 17:25
optimizing for the uncommon case rarely brings benefits

vukolic
2016-06-28 17:25
and if it complicates the code - for me it was a no go

jyellick
2016-06-28 17:25
Right.

jyellick
2016-06-28 17:25
This is why I was curious if such a scheme would be good for requests, as it is the common path.

vukolic
2016-06-28 17:25
it is if client's signature is expensive for you

vukolic
2016-06-28 17:26
to generate and verify

vukolic
2016-06-28 17:26
for generation we do not care - client does it

vukolic
2016-06-28 17:26
for verification - IMO it is much faster/scalable than all to all chat

jyellick
2016-06-28 17:26
Ah, but the problem here is that the only clients are validating peers

jyellick
2016-06-28 17:27
(ie replicas)

vukolic
2016-06-28 17:27
yes - so they will have a bit more latency with a signature - but then again a decent signature is miliseconds

vukolic
2016-06-28 17:27
all to all among consenters might be 100s of ms

vukolic
2016-06-28 17:28
+ there is a BigMac issue described in that paper up there

vukolic
2016-06-28 17:28
so conclusion of that paper - if you want stable performance - do have clients' signatures

jyellick
2016-06-28 17:29
Maybe for v2 we should be looking at having the client and consenter be truly different entities?

vukolic
2016-06-28 17:29
this is the case, no?

vukolic
2016-06-28 17:29
client = peer

vukolic
2016-06-28 17:29
consenter = consenter

jyellick
2016-06-28 17:30
But today, the peer submits a fabric transaction, which is not a PBFT request. Then, the PBFT replica receives that transaction, wraps it in a PBFT request, and submits it to the network, acting as a client.

vukolic
2016-06-28 17:31
my intuition is that this is ok so long as one verifies fabric tx signature

vukolic
2016-06-28 17:31
(not sure we do that though)

vukolic
2016-06-28 17:32
this should be the TCert signature

vukolic
2016-06-28 17:32
the question is do we verify it in PBFT or not?

jyellick
2016-06-28 17:33
I do not believe that we do.

jyellick
2016-06-28 17:35
``` message Transaction { enum Type { UNDEFINED = 0; // deploy a chaincode to the network and call `Init` function CHAINCODE_DEPLOY = 1; // call a chaincode `Invoke` function as a transaction CHAINCODE_INVOKE = 2; // call a chaincode `query` function CHAINCODE_QUERY = 3; // terminate a chaincode; not implemented yet CHAINCODE_TERMINATE = 4; } Type type = 1; //store ChaincodeID as bytes so its encrypted value can be stored bytes chaincodeID = 2; bytes payload = 3; bytes metadata = 4; string uuid = 5; google.protobuf.Timestamp timestamp = 6; ConfidentialityLevel confidentialityLevel = 7; string confidentialityProtocolVersion = 8; bytes nonce = 9; bytes toValidators = 10; bytes cert = 11; bytes signature = 12; } ``` This is the transaction definition

jyellick
2016-06-28 17:35
``` message request { google.protobuf.Timestamp timestamp = 1; // Generated at the client level. Ensures that client's requests are atomically ordered. bytes payload = 2; // opaque payload uint64 replica_id = 3; bytes signature = 4; } ```

jyellick
2016-06-28 17:35
This is the request definition.

vukolic
2016-06-28 17:36
ah so there are two signatures?

jyellick
2016-06-28 17:36
Well, it looks that way, but today, we do nothing to populate or validate the `Request` signaure

kostas
2016-06-28 17:36
One for the submitter, one for the validator.

jyellick
2016-06-28 17:37
We could start doing this, but then we will be having two signatures, which seems suboptimal

vukolic
2016-06-28 17:37
so request.payload is Transaction (from above)

jyellick
2016-06-28 17:38
Correct, marshaled via protobuf to a byte slice.

kostas
2016-06-28 17:38
But if the plan is to have the validator assign it a slot number, he needs to sign on that assignment. So you definitely need a validator signature.

vukolic
2016-06-28 17:38
then we could verify Transaction.signature

vukolic
2016-06-28 17:38
aha - what is slot?

vukolic
2016-06-28 17:38
@kostas

jyellick
2016-06-28 17:38
@vukolic: This is a new concept

vukolic
2016-06-28 17:39
creative :slightly_smiling_face:

jyellick
2016-06-28 17:39
You can think of them as 'virtual clients'

jyellick
2016-06-28 17:39
The problem we had was, per the PBFT paper, a client should only have one request in flight at a time

jyellick
2016-06-28 17:39
It should wait until the request is fulfilled before submitting another one

jyellick
2016-06-28 17:40
With multiple requests in flight, the primary can pick the later request to order first, and the network will believe that the second request is stale (because its time stamp is older), and effectively censor this second request.

jyellick
2016-06-28 17:40
We suffer from this censorship problem today.

jyellick
2016-06-28 17:40
The idea was to assign each submitted request to one of a finite number of 'slots', or 'virtual client ids' (I like 'slot', as it's shorter)

jyellick
2016-06-28 17:41
So that you could have as many outstanding requests, as you had slots, and still solve the censorship problem.

vukolic
2016-06-28 17:41
how about using not UTC timestamps but a counter

vukolic
2016-06-28 17:41
so primary would not be able to do this

kostas
2016-06-28 17:41
(For more context on slots, they were first brought up here I think: https://hyperledgerproject.slack.com/archives/fabric-consensus-dev/p1466604480000530)

vukolic
2016-06-28 17:42
(because holes could be detected)

jyellick
2016-06-28 17:42
I briefly considered a counter, I think the problem is that a client may not always be sure if its request was processed.

jyellick
2016-06-28 17:43
Imagine the replica crashes, or is too slow.

jyellick
2016-06-28 17:43
Then the replica would be unsure of what its counter's value should truly be at.

jyellick
2016-06-28 17:43
(Because the network may have executed his requests, or not, it has no way to know)

vukolic
2016-06-28 17:43
hmmm...

vukolic
2016-06-28 17:43
don't we implement consensus :slightly_smiling_face:

vukolic
2016-06-28 17:44
anyway seems we are going in the direction of including FIFO into consensus total order

vukolic
2016-06-28 17:45
there is a precedence for this - a very practical system actually doing this

vukolic
2016-06-28 17:45
(albeit in crash)

vukolic
2016-06-28 17:45
it is Zookeeper


vukolic
2016-06-28 17:46
that said - let me have a look at slots

vukolic
2016-06-28 17:49
ok so what I could quickly grasp is - slots can be seen to be a sort of a moving window on a counter?

vukolic
2016-06-28 17:49
does that interpretation make sense?

jyellick
2016-06-28 17:50
I would say, slots are analogous to virtual clients. Each slot holds a request until it is executed. In this way, you can have as many requests in flight, as you have slots, without risking duplication or censorship.

jyellick
2016-06-28 17:51
When the replica, acting as a client, generates a request, it picks a free slot to assign that request to. In this way, it simulates that it had received this request from that virtual client.

vukolic
2016-06-28 17:51
so my take on this is - stick with 1 request at a time for v0.5

vukolic
2016-06-28 17:51
and lets add sth like slots w. FIFO for v2

jyellick
2016-06-28 17:51
Absolutely, this is all post 0.5

vukolic
2016-06-28 17:51
but for v2 we do need also signatures

vukolic
2016-06-28 17:52
clients' signatures

kostas
2016-06-28 17:52
(All the work from now on is post v0.5 by the way.)

vukolic
2016-06-28 17:52
allowing multiple outstanding requests (w. FIFO) is a noble idea

vukolic
2016-06-28 17:52
and should be implemented :slightly_smiling_face:

jyellick
2016-06-28 17:53
Yes, I think we can definitely preserve the FIFO nature with slots. I will think on switching to a counter over a timestamp, as although I think it may have some problems, it may have fewer than the timestamp.

vukolic
2016-06-28 17:53
BTW, guys do have a look at ZAB paper and making BFT systems tolerate byzantine faults papers

vukolic
2016-06-28 17:54
both are good stuff

jamie.steiner
2016-06-28 17:54
is there an interface specification for implementing a new/different consensus algorithm?

vukolic
2016-06-28 17:54
@kostas is the expert on specs :slightly_smiling_face:

vukolic
2016-06-28 17:55
short answer is - yes

jyellick
2016-06-28 17:55
@jamie.steiner: There is a protocol spec, but I think it is unfortunately a little outdated

jamie.steiner
2016-06-28 17:55
:point_left: :point_right:

jamie.steiner
2016-06-28 17:55
:slightly_smiling_face:

kostas
2016-06-28 17:55
Look into the spec, section 3.4


kostas
2016-06-28 17:55
But as Marko said it is outdated.

vukolic
2016-06-28 17:55
for the outdated hint

vukolic
2016-06-28 17:56
for an outlook in how the interface will look like - down the road have a look here https://github.com/hyperledger/fabric/wiki/Next-Consensus-Architecture-Proposal

vukolic
2016-06-28 17:57
so we have an outlook and dated specs - but not the current one :slightly_smiling_face:

jamie.steiner
2016-06-28 17:57

vukolic
2016-06-28 17:58
that one is still on my "to read" list...

kostas
2016-06-28 18:00
I am still trying to wrap my head around the "Attacking PBFT" section in Miller's paper, but that's another story I guess.

vukolic
2016-06-28 18:00
in the meantime - informed, brief reviews welcome here

vukolic
2016-06-28 18:01
or perhaps we can open a #reading-section channel :slightly_smiling_face:

davidjhowie
2016-06-28 22:12
has joined #fabric-consensus-dev

simon
2016-06-29 13:02
hi guys

kostas
2016-06-29 13:03
hi

simon
2016-06-29 13:17
so what's next on the agenda?

simon
2016-06-29 13:17
i played around with performance measurement without chaincode

kostas
2016-06-29 13:17
I saw the numbers, 4K tps

simon
2016-06-29 13:17
yep

kostas
2016-06-29 13:19
So I'm working on unifying obc-batch and PBFT piece by piece. Working on this with Jason to make sure I don't break his work.

tuand
2016-06-29 13:20
per binh, we also need to come up with list of v2 work items and add them to https://github.com/hyperledger/fabric/wiki/Fabric-Next

simon
2016-06-29 13:21
for our stuff or for everybody?

simon
2016-06-29 13:21
oh this is super high level stuff

tuand
2016-06-29 13:22
for consensus specifically but nothing that says we can't add other requirements

tuand
2016-06-29 13:25
@vukolic, @cca need your input as well ^^^

vukolic
2016-06-29 13:26
1) come up with a centralized consensus service implementation (single consenter)

vukolic
2016-06-29 13:26
from that point on we split peer development and consensus development in parallel

vukolic
2016-06-29 13:27
2) work on peer implementation per design document

vukolic
2016-06-29 13:27
3) work on extracting PBFT from the current codebase as the consensus service

vukolic
2016-06-29 13:27
2 and 3 are in parallel

vukolic
2016-06-29 13:28
let me know if I should suggest breaking 2 and 3 into more details or is this sufficient

tuand
2016-06-29 13:28
more details always welcome :slightly_smiling_face: will help with prioritization

vukolic
2016-06-29 13:29
for 2)

vukolic
2016-06-29 13:30
2a) implement basic peer level protocols (Sections 1-5). Haifa Research Lab will also work in paralel on peer-to-peer communication facilities we need to see how do we integrate with them

tuand
2016-06-29 13:30
also @garisingh has requirements for v2 implementation ?

vukolic
2016-06-29 13:30
2b) add peer reconfiguration

vukolic
2016-06-29 13:30
2c) add support for confidential chaincodes

vukolic
2016-06-29 13:31
3a) start by extracting stuff from current codebase

vukolic
2016-06-29 13:32
3b) perhaps plug-in apache kafka as a pilot distributed consensus service (although crash-tolerant)

simon
2016-06-29 13:32
are we concerned about confidential chaincodes?

vukolic
2016-06-29 13:32
3c) go to more elaborate protocols instead of PBFT

cbf
2016-06-29 13:32
+1 I would like to see us avoid rolling our own messaging

simon
2016-06-29 13:32
so who is implementing the submitting peer and endorser part?

cbf
2016-06-29 13:33
kafka, 0mq, NATS

vukolic
2016-06-29 13:33
@simon yes we are - we will have some fabric level support for them - we need to support confidential stuff

cbf
2016-06-29 13:33
there are a number of viable alternatives that would be far more robust than anything we might write

vukolic
2016-06-29 13:34
@cbf absolutely, the problem is none of these is byzantine fault tolerant so none of these is really ideal for HL fabric

vukolic
2016-06-29 13:34
so we will have to be building in parallel our own stuff

vukolic
2016-06-29 13:34
that is hopefully going to become the standard (state-of-the-art) as much as these other things mentioned above are (in the crash-recovery world)

cbf
2016-06-29 13:34
and the messaging itself needs to be BFT?

cbf
2016-06-29 13:34
because why?

vukolic
2016-06-29 13:35
no it does not

cbf
2016-06-29 13:35
exactly

vukolic
2016-06-29 13:35
we were talking on diff level of abstractions - sorry

vukolic
2016-06-29 13:35
when I said Kafka, I really meant kafka as the consensus service

vukolic
2016-06-29 13:35
because the API is really similar

vukolic
2016-06-29 13:35
except for the hashchain which should not be difficult to add

simon
2016-06-29 13:36
i wonder who will implement the submitting peer

simon
2016-06-29 13:37
and the endorsers

vukolic
2016-06-29 13:37
this we need to decide... How do we split current dev set and how do we get it reinforced

vukolic
2016-06-29 13:38
I presume we start from dev preferences?

simon
2016-06-29 13:38
well we have a lot of work just working on pbft

tuand
2016-06-29 13:38
no one signed up and/or assigned to anything yet ... at this point, get the components sync designs and list items for v2

vukolic
2016-06-29 13:39
for peer level stuff I expect significant contributions from Haifa folks

vukolic
2016-06-29 13:39
but it would be great we get community involvement

vukolic
2016-06-29 13:39
I am trying to get BFT-smart folks on board HL fabric as well

vukolic
2016-06-29 13:39
re Haifaalthough

vukolic
2016-06-29 13:40
sorry, re Haifa: they want to focus more on communication

vukolic
2016-06-29 13:40
and for submitter endorser perhaps the logical choice is to ask ledger/execution folks from v0.5 to help

simon
2016-06-29 13:42
endorser interface is just chaincode + network

vukolic
2016-06-29 13:42
true... + peer's interface to ledger

vukolic
2016-06-29 13:43
hence it is chaincode (execution) + network + ledger

vukolic
2016-06-29 13:43
ledger seems critical there with all the readset/writeset/stateUpdate stuff

vukolic
2016-06-29 13:44
for network we probably should have Jeff and other folks responsible for messaging now working with Haifa

nits7sid
2016-06-29 14:38
Hi...i am using batch pbft with 4 peers and a CA..I was actually doin the performance analysis by setting the batch size to 100.. I fired 100 transactions in a loop and noticed that the block got generated in 6s..then I just fired a single transaction and it took 15s to commit the block..so how is the batch size and time to commit a block related?

tuand
2016-06-29 14:40
they are not ... batchsize is used to batch incoming requests before sending them through consensus. how long the chaincode takes to execute is another matter

nits7sid
2016-06-29 14:54
Ohh and what is the batch timeout exactly?

tuand
2016-06-29 14:56
when the timeout fires, the batch of requests is processed even if we do not have batchsize requests inputted yet

nits7sid
2016-06-29 14:59
ohh

nits7sid
2016-06-29 14:59
thanks @tuand

tuand
2016-06-29 15:01
np ... check out PR #2003 and the # channel

iko
2016-06-29 19:43
has joined #fabric-consensus-dev

jyellick
2016-06-30 13:36
@simon: Are you around?

simon
2016-06-30 13:37
i am!

jyellick
2016-06-30 13:38
You haven't spoken up on https://github.com/hyperledger/fabric/issues/2053 yet. I'm battling back in forth in my head over which answer I think is best, was wondering your thoughts on TCP vs UDP

simon
2016-06-30 13:41
well, i don't think we should go to udp

simon
2016-06-30 13:41
there doesn't seem to be any reason

simon
2016-06-30 13:42
if we have an obvious optimization by relying on (typically) sequential delivery, we should go for it

jyellick
2016-06-30 13:44
Okay. I think that's certainly the easiest path

jyellick
2016-06-30 13:45
It seems to me, like ultimately, we could have a faster system if we built it on UDP, and clearly PBFT was designed for a UDP network. We've already implemented a lot of the things like windowing which would be needed for UDP to work.

jyellick
2016-06-30 13:46
But, there's certainly something to be said for making our lives easier, and we're a long way from TCP being our bottleneck at this point.

jyellick
2016-06-30 13:53
@simon Care to voice your opinion in 2053?

simon
2016-06-30 14:04
i don't even know what we are discussing

jyellick
2016-06-30 14:06
Basically "Should we try to keep the code tolerant of non sequential delivery?"

simon
2016-06-30 14:08
well, do we have anything actionable?

garisingh
2016-06-30 14:08
are we trying to talk about the "new" mysterious code for the mythical 2.0 architecture?

jyellick
2016-06-30 14:11
@garisingh: I was trying to dispose of some technical debt we've built up, namely some byzantine scenarios which we're vulnerable to. The implementation is much simpler if we can rely on FIFO links, but, the PBFT paper does not..

simon
2016-06-30 14:12
for example?

kostas
2016-06-30 14:13
"If the next pre-prepare that you receive from the primary doesn't correspond to seqNo 11 (i.e. it creates a hole), you should view-change."

jyellick
2016-06-30 14:13
The question is, should we try to keep things in a state that, if we so chose, we could switch a non FIFO link network protocol (like UDP), or embrace the fact that we're running over TCP/gRPC and take the optimizations.

simon
2016-06-30 14:13
i think it is much harder to reason about changes

jyellick
2016-06-30 14:14
Is there anything in the code today that relies on FIFO ordering? Nothing jumped to mind for me

garisingh
2016-06-30 14:14
in either case, you need to check for missing sequence numbers so you are always tracking the last seq no you received / processed. As far as I know, gRPC does not give you access to seq no and the gRPC link could break

jyellick
2016-06-30 14:14
This is a PBFT seqNo?

garisingh
2016-06-30 14:15
right - my point is you still need to track the last one received no matter what even on a FIFO link

garisingh
2016-06-30 14:15
maybe you don't get them out of order, but you could miss them

jyellick
2016-06-30 14:15
Certainly. But, if you are missing seqNo 10, and you see seqNo 11, then you know you missed something, with ordering.

garisingh
2016-06-30 14:16
but the seqNo is in the PBFT protocol message anyway, correct? I am most likely missing something here

simon
2016-06-30 14:17
jyellick: but you also may have been disconnected and reconnected?

jyellick
2016-06-30 14:17
Yes, the seqNo is part of the PBFT protocol message, and PBFT assumes that these sequence numbers will arrive in any order, and so long as they are within a sliding window, we are able to process them.

kostas
2016-06-30 14:17
Correct, but he'll know he missed something and, maybe kick in state transfer, instead of thinking that this is normal due to messages arriving out of order.

garisingh
2016-06-30 14:17
you can't rely on gRPC for sequence numbers - you can only rely on the fact that if a stream stays connected messages sent over the stream will be delivered in order (no concurrent access)

kostas
2016-06-30 14:18
Ultimately, a FIFO link allows you to detect that things are abnormal faster.

garisingh
2016-06-30 14:18
but if the link breaks, you have no idea what you missed

garisingh
2016-06-30 14:18
@kostas - I think I agree with that

garisingh
2016-06-30 14:18
and you may not need to try to sort - just track the last protocol seqNO received

jyellick
2016-06-30 14:18
But you can detect that the link broke, and take appropriate action. And when the link is re-established, we know the first message we get, is the lowest seqNo we will receive from that link

simon
2016-06-30 14:19
yes

garisingh
2016-06-30 14:19
ah - gotcha - and then you can "request" the missing seqNos since you don't you won't get them over the re-established link

simon
2016-06-30 14:19
but would that simplify code?

garisingh
2016-06-30 14:21
do we wait around today to see if out of order messages come in before requesting them? Or do we not request them directly at all?

jyellick
2016-06-30 14:29
We wait around for them. There's no checking for ordering in the code today. Basically, if there are enough healthy nodes, the network moves on, and eventually we notice, that enough people are talking about sequence numbers outside of our moving window, that we must be behind, and then do some recovery stuff. If there aren't enough healthy nodes to keep the network going, we do a view change, which basically picks a new starting sequence number and state that everyone can agree to work forward from.

garisingh
2016-06-30 14:30
well we can improve that :wink:

jyellick
2016-06-30 14:32
The piece of code I was working on, was to solve a censorship problem. From the PBFT paper, clients are supposed to only have one request in flight, in this way, each client gets a request timer, and if the request is censored, then a view change happens. For us, each VP is a client, so, having only 1 request in flight is a non-starter, so, we needed to basically pretend each VP is multiple clients. In the unordered case, there's some complexity, where we can't necessarily expect requests we send to the primary for ordering to be ordered in the same order we sent them. It's doable, but trickier than assuming that if the primary sends out of order requests, or skips one, that we immediately know there's a problem, rather than having to wait for the timer to pop.

garisingh
2016-06-30 14:33
gotcha - this makes sense to me - `but trickier than assuming that if the primary sends out of order requests, or skips one, that we immediately know there's a problem, rather than having to wait for the timer to pop.`

garisingh
2016-06-30 14:34
that what I was getting at. at that should be true for gRPC assuming you are correct in the order you send :wink:

garisingh
2016-06-30 14:35
meaning your inputs from the primary to the stream must be in order to start with :wink:

garisingh
2016-06-30 14:35
this is how pub/sub protocols general work as well FWIW

jyellick
2016-06-30 14:37
Exactly. Simplifying our lives not only makes the code easier, but probably reduces bugs, may improve execution times, etc. So, if we've got FIFO links, I think we should definitely use them. My question was, talking with some of the distributed guys, I've heard opinions like "to ever seriously scale, you must be able to do UDP". I think we could switch to UDP today, without too much headache, but I think we need to commit one way or the other.

garisingh
2016-06-30 14:39
Not sure you have to go to UDP. UDP is probably better if you are going to "broadcast" from a single peer to tons of peers - but you can also go with more of a mesh network as well where not all peers are connected to each other directly

garisingh
2016-06-30 14:40
If we move to more of a "broadcast" model, you could follow the model where if you see a message for the first time you re-broadcast it

jyellick
2016-06-30 14:40
PBFT is certainly broadcast heavy, there is no normal path unicasting that I can think of.

garisingh
2016-06-30 14:42
I think we need to look at how we really do "atomic broadcast" if we are going that route for 2.0 and where something like PBFT fits in (if anywhere). Clearly if you have a centralize consensus service you could use PBFT among the consenter processes, but I would assume we would move to a broadcast model for deliver to committers / followers

garisingh
2016-06-30 14:43
That's what I'd like to see laid out for this 2.0 piece. We should be completely agnostic of blockchain, MVCC, etc. We need to prove out a simple "centralized" broadcast service for "log" replication without caring about what's in the log message nor what the state machine which process the "log" on the committer does

jyellick
2016-06-30 14:44
+1 to that. It terrifies me when I see people making comments about how "consensus should filter out bad transactions"

jyellick
2016-06-30 14:45
Consensus should do log replication, so that everyone gets the same order of, whatever it is, we don't care.

simon
2016-06-30 14:45
well, you'd like that there is admission control

garisingh
2016-06-30 14:45
I would start simple - build a simple broadcast service and I personally would probably use something like Raft between the consenters and broadcast between the consenter and the committers

garisingh
2016-06-30 14:45
then build on top of that

simon
2016-06-30 14:45
well go ahead and implement raft

simon
2016-06-30 14:45
because we thought pbft was simple, and it is real difficult

garisingh
2016-06-30 14:46
@simon - actually I would just take the etcd/raft implementation. I am not saying ti would be the final thing, just a simple way for a dummy like me to take code that exists to prover out a consenter service

garisingh
2016-06-30 14:47
we could actually use the PBFT implementation today in place of Raft I just have no idea how to extract it from the fabric code :wink:


garisingh
2016-06-30 14:49
But in the end, all I am really saying is to start with the basic "log" replication stuff using a consenter service rather than trying to do everything at once. Incremental build out rather than extract and retrofit

jyellick
2016-06-30 14:50
So, going way back to my original question, should we assume FIFO links going forward or not? The unfortunate answer it sounds to me like, is "We don't know what this future thing looks like, so let's keep our options open"

garisingh
2016-06-30 14:52
the question is are you trying to fix the exiting implementation in hopes that we use it in the future or just trying to fix / simplify the existing implementation?

jyellick
2016-06-30 14:52
The former (fix it, under the hope that it is used in the future)

jyellick
2016-06-30 14:53
Frankly, if we don't think (there's at least a decent chance) we'll be using the existing implementation in the future, why bother doing any work on it at all?

garisingh
2016-06-30 14:54
agreed. if we can give you guys a chance to come up for air, we need to have some discussions about what 2.0 architecture really looks like

jyellick
2016-06-30 14:58
Well, we are heads down on master, I don't think anyone is on the 0.5 branch right now. So, if we can't say whether we see a future for the consensus code in master, then I don't see any reason we should not have time for this discussion.

garisingh
2016-06-30 15:00
okay - cool. I'd love to be a part of it, but unfortunately I am more or less out the rest of today and tomorrow. but I'd love to see some more details on the underpinnings of the architecture and discuss an incremental build plan for a modular architecture :wink:

kostas
2016-06-30 19:55
Can we still make the case for this `validate` method, or should I remove it while I'm doing the refactoring? https://github.com/hyperledger/fabric/blob/master/consensus/obcpbft/obc-batch.go#L199

jyellick
2016-06-30 20:06
My vote is kill it. I don't see how this is useful to us

tuand
2016-06-30 20:07
+1

vukolic
2016-06-30 21:05
@garisingh if one wants to try a crash-fault tolerant protocol in fabric v2 in place of a consensus service then one should not use etcd/raft or anything but Kafka

vukolic
2016-06-30 21:05
which needs to be hacked just to output the hashchain instead of a simple total order of requests

vukolic
2016-06-30 21:06
I am talking single-topic single-partition Kafka

vukolic
2016-06-30 21:07
as Kafka already functions as we want consensus service to function (etcd/raft function differently)

vukolic
2016-06-30 21:08
of course we need BFT Kafka - which does not exist, and we can start from making one around our v0.5 PBFT code

vukolic
2016-06-30 21:08
later on we replace this with a more scalable protocol than PBFT

garisingh
2016-07-01 10:02
@vukolic: Agreed on Kafka and it would not be hard to output the hash chain using Kafka. My point on etcd/raft was slightly different though. Kafka use ZK for some pieces of coordination (although they have lots of other stuff built into the brokers as well) and unfortunately it has proven hard to completely eliminate the tie to ZK with Kafka. My (potentially misguided) mention of Raft assumed that we built our own "broadcast" piece (i.e. our own Kafka) and that we would still need a mechanism to coordinate those processes (which is where I would the the Raft library to start with)

garisingh
2016-07-01 10:03
I don't believe Raft is the answer, but if I can take existing libraries and piece them together to prove out a basic concept, that is the approach I generally take.

lhaskins
2016-07-01 17:57
has joined #fabric-consensus-dev

sunsay00
2016-07-01 21:19
has joined #fabric-consensus-dev

vipinb
2016-07-01 23:17
@vukolic: @simon should wait for Honey badger open source

cca
2016-07-02 06:08
HoneyBadger code is on github, just in a different branch - https://github.com/amiller/HoneyBadgerBFT/tree/another-dev

simon
2016-07-04 14:10
so with consensus v2, where do we store our consensus config, i.e. replica list, etc.

simon
2016-07-04 14:10
do we maintain our own mini-ledger?

simon
2016-07-04 14:10
do we persist into a database?

simon
2016-07-04 14:10
i'd persist into text files

simon
2016-07-04 15:04
jyellick: you around?

jyellick
2016-07-04 18:02
@simon USA holiday today, but can maybe help quickly

thomas.leplus
2016-07-05 00:40
has joined #fabric-consensus-dev

simon
2016-07-05 11:48
hi

jyellick
2016-07-05 13:44
Hey @simon, did you get what you needed yesterday?

simon
2016-07-05 13:45
hi jyellick

simon
2016-07-05 13:45
i forgot what it was about

simon
2016-07-05 13:45
i'm implementing a separate consensus peer binary

jyellick
2016-07-05 13:46
Neat, going to implement the single consenter consensus? Or are you porting PBFT to start?

simon
2016-07-05 13:47
using our obcpbft

simon
2016-07-05 13:47
so much crusty interface

simon
2016-07-05 13:47
so beh

jyellick
2016-07-05 13:51
Yeah, hopefully we can clean that interface, Kostas was working on switching the PrePrepare to carry a RequestBlock instead of a Request, once he finishes this, I think we'll be able to finally merge batch/core and clean much of that up]

simon
2016-07-05 14:07
i'm trying to decide where to put the grpc service for consensus

simon
2016-07-05 14:08
does it go into pbft or outside?

simon
2016-07-05 14:08
it feels like it should be part of the consensus implementation

simon
2016-07-05 14:08
i.e. connection management

jyellick
2016-07-05 14:15
I'd assume you're trying to keep it pluggable so it would be relatively easy to drop another implementation in? I would think gRPC should live on the outside

jyellick
2016-07-05 14:16
Though certainly you might need to have some hooks for the connection/disconnection events.

simon
2016-07-05 14:24
i want to avoid excessive marshalling/unmarshalling

simon
2016-07-05 14:24
which currently is the bottleneck

jzhang
2016-07-05 14:46
has joined #fabric-consensus-dev

jyellick
2016-07-05 14:50
Other than the batch marshal/unmarshal, where are we doing that?

simon
2016-07-05 16:40
i don't know exactly how the composition breaks down

simon
2016-07-05 16:41
a lot in hashReq

simon
2016-07-05 16:41
and in txID

jyellick
2016-07-05 17:03
Ah, yes, we do use the marshaling for computing those digests. We could certainly write a custom digest of some sort, though it seems debatable whether it would be more efficient than the marshaling

simon
2016-07-05 17:06
well we should only hash it once

simon
2016-07-05 17:07
jyellick: i'm debating whether the grpc link between peers should receive a stream or send a stream

jyellick
2016-07-05 17:07
Hmmm

jyellick
2016-07-05 17:08
Obviously they could both work, what do you see as the tradeoffs?

simon
2016-07-05 17:08
nothing really

simon
2016-07-05 17:09
i'm trying to figure out how structure the code

jyellick
2016-07-05 17:12
Presumably this gRPC piece will maintain the point to point links between the peers, and register/deregister connections as consenters come/go? Are you not simply lifting this from the existing peer implementation?

simon
2016-07-05 17:16
oh no

simon
2016-07-05 17:16
absolutely not

simon
2016-07-05 17:16
the current peer code is super messy

jyellick
2016-07-05 17:18
Fair enough. Not to add to your troubles, but I think it would be nice to keep the door open to not use gRPC. Something with true broadcasts might be ultimately what's required for scale.

simon
2016-07-05 17:35
what do you mean by true broadcasts?

simon
2016-07-05 17:44
RAII is messy :confused:


simon
2016-07-05 17:56
@jyellick: it looks quite simple

simon
2016-07-05 17:56
of course i didn't run it yet

simon
2016-07-05 17:56
i need a solution for defining the replica set

jyellick
2016-07-05 18:26
@simon By "true broadcast" I was thinking something like multicast, though I suppose PBFT assumes message validity because the point to point links are secured by an exchanged symmetric key so maybe this is not an obvious drop in

simon
2016-07-05 18:26
wide area multicast?

simon
2016-07-05 18:26
that seems like a lot of pain

jyellick
2016-07-05 18:34
Could be. I'm still not sure how people plan to run this, I've not seen anybody actually try to run things other than same physical location.

jyellick
2016-07-05 18:34
(Not to say that the WAN scenario isn't coming)

simon
2016-07-05 18:42
interesting.

jyellick
2016-07-05 18:46
But even ignoring multicast, apparently some other UDP based transport is still on the table, especially as this was the original design for PBFT.

simon
2016-07-05 19:01
i wouldn't

simon
2016-07-05 19:01
then you need to do DTLS

jyellick
2016-07-05 19:05
I'm not too familiar with DTLS, is it painful?

simon
2016-07-05 19:08
well, i don't see any benefit

simon
2016-07-05 19:08
except for lost packets

garisingh
2016-07-06 11:27
it is really a question of connection versus connectionless when it comes to TCP or UDP( @jyellick - DTLS is basically the analog of TLS for UDP). For WAN-based deployments there is not way to do traditional multicast (since it would be impossible to know what networks / switches to broadcast on)

garisingh
2016-07-06 11:30
@simon - on the gRPC issue, isn't the question really whether or not we continue to use protobufs as the serialization format? Technically we should be able to transport the same protobufs over other transports if we ever wanted to given that gRPC is basically protobufs over HTTP/2 (yes - I know there is an additional serialization for the gRPC stream but in the end what gets delivered to the handlers are protobufs)

simon
2016-07-06 11:30
or we transport other data via gRPC

garisingh
2016-07-06 11:30
well minimally gRPC will require a base protobuf

simon
2016-07-06 11:30
i just saw a lot of memory/GC activity, my suspicion is mostly due to protobufs

simon
2016-07-06 11:31
no, gRPC seems to be able to handle other encoders/decoders

garisingh
2016-07-06 11:33
well that is technically true I guess, but have you ever seen an example of that? and I don't believe that there are (any?) libraries / generators for anything other than the default protobuf implementation

simon
2016-07-06 11:35
just something to keep in mind

garisingh
2016-07-06 11:37
I am not against swapping out the serialization / transport, but in that case we are probably better off just using the raw building blocks (e.g. TCP or HTTP/2) rather than gRPC :wink:

simon
2016-07-06 11:39
i don't know - it does nice things for us

garisingh
2016-07-06 11:46
actually - this is kinda interesting - https://open.dgraph.io/post/rpc-vs-grpc/

simon
2016-07-06 18:15
jyellick: i just pushed a refactored version of the client

simon
2016-07-06 18:15
i think it is getting to a point where it should be able to create a network, but i didn't try that yet

simon
2016-07-06 18:16
basically you need to create a local cert + key, and push the certs of all replicas into the persistence store (`data/config.peers.`)

jyellick
2016-07-06 18:19
Neat, that was quick, I'll try to take a look

simon
2016-07-06 18:48
i spent most time figuring out tls :slightly_smiling_face:

simon
2016-07-07 14:24
yey, getting close

simon
2016-07-07 14:25
panic in sign: not implemented

simon
2016-07-07 14:25
this is where i don't know what to do

simon
2016-07-07 14:25
the crypto layer in fabric is huge and confusing

jyellick
2016-07-07 14:26
Yes, I know there are the assorted certificate types, but it seems like we may have to introduce new ones for the consenter/endorser split?

simon
2016-07-07 14:28
hmm

simon
2016-07-07 14:28
messages not being broadcast?

simon
2016-07-07 14:29
so, we never start our outstanding requests timer when there are outstanding requests

simon
2016-07-07 14:29
but we also don't submit them

simon
2016-07-07 14:30
should we?

simon
2016-07-07 14:32
why is the consensus code still using vp0... etc?

simon
2016-07-07 14:32
instead of actual peerids?

simon
2016-07-07 14:32
what a mess this is

jyellick
2016-07-07 14:36
The outstanding request stuff is definitely pending an overhaul, it is a mess

jyellick
2016-07-07 14:36
I've got some outstanding code to fix this

jyellick
2016-07-07 14:38
But, I'm waiting for @kostas to finish with his work to make the PrePrepare carry a non-opaque RequestBlock instead of a Request, so that we can then merge core/batch

jyellick
2016-07-07 14:39
And I'm not sure what you mean by "vp0..." etc.? I see that some in our tests, and, we need to keep a ReplicaID around, so that we can compute the leader after viewchange (for which a peerID wouldn't suffice)

simon
2016-07-07 14:43
well the interface sucks

simon
2016-07-07 14:43
we require the peerid to match vp%d

simon
2016-07-07 14:43
which is pointless

jyellick
2016-07-07 14:43
Oh, I thought we killed that

simon
2016-07-07 14:43
yea so did i

jyellick
2016-07-07 14:43
I thought we have behave tests which verify this

simon
2016-07-07 14:44
i wrote code for a whitelist with certificates

simon
2016-07-07 14:44
but it doesn't live inside pbft

simon
2016-07-07 14:44
maybe eventually we can merge those two codebases as well

simon
2016-07-07 14:44
or use a better interface

simon
2016-07-07 14:45
now i need to figure out why exec doesn't finish

simon
2016-07-07 14:46
oh there is still a reference to tx.Uuid

simon
2016-07-07 14:46
how do i post an executedEvent?

simon
2016-07-07 14:47
OH, i shouldn't implement the executor interface directly

simon
2016-07-07 14:48
we really must stop putting all these interfaces into consenter

jyellick
2016-07-07 14:50
Absolutely, that consenter interface is in desperate need of an overhaul

jyellick
2016-07-07 14:51
The whole monolithic interface always bugged me. The thing doing network communication really should not be the thing doing ledger access. It's odd that those are tied together.

simon
2016-07-07 14:59
why does the executor create a state transfer instance?

simon
2016-07-07 14:59
god so messy

simon
2016-07-07 14:59
so much to stub out

simon
2016-07-07 15:05
so what happened right now is that the state transfer gets pulled in

simon
2016-07-07 15:05
and with it all kinds of ledger code, etc.

simon
2016-07-07 15:06
is there a way to decouple executor and state transfer?

jyellick
2016-07-07 15:09
Hmmm

jyellick
2016-07-07 15:10
So, I think we need to ask where state transfer should live.

jyellick
2016-07-07 15:11
In the new design, do we have state transfer for the PBFT log replication, or for the blockchain, or for both?

jyellick
2016-07-07 15:12
Leaving things largely as they stand today, I'd say we simply need to include a state transfer network API in addition to the execution network API

jyellick
2016-07-07 15:13
This also begs the question of how we for instance get the value for our checkpoints? Is this still the blockinfo?

jyellick
2016-07-07 15:13
Really, I think we need to know what the network interface to the endorsers is

simon
2016-07-07 15:17
i think we just do consensus

simon
2016-07-07 15:17
which means we need to do some minimal sort of state transfer for consensus

jyellick
2016-07-07 15:18
So, state transfer is only on the PBFT log then?

simon
2016-07-07 15:18
@vukolic: ideas?

simon
2016-07-07 15:18
do we even maintain a log?

simon
2016-07-07 15:18
i don't think so

jyellick
2016-07-07 15:18
We do not. Though it seems like we must, in order to 'do consensus'

simon
2016-07-07 15:20
why?

simon
2016-07-07 15:20
we just do total order broadcast

simon
2016-07-07 15:21
we really need to merge batch and core

jyellick
2016-07-07 15:21
True. I'm unsure then how we supply a state transfer target if there is a gap in our log

simon
2016-07-07 15:22
i just had a "missing request" issue

simon
2016-07-07 15:22
odd

jyellick
2016-07-07 15:22
Yes, we do. I know that @kostas is eager to do this, and I told him we could wait until his return.

simon
2016-07-07 15:22
so for doing total order, we don't need to maintain a log

jyellick
2016-07-07 15:23
I really need to understand this network interface

simon
2016-07-07 15:23
but if all endorsers/peers miss a broadcast block, it is lost forever

jyellick
2016-07-07 15:23
Does everyone get ordering messages from all consenters? Or do they subscribe to a particular one

simon
2016-07-07 15:23
unclear yet

simon
2016-07-07 15:23
i would say, you connect to the consenter cloud

simon
2016-07-07 15:24
maybe to one particular one

simon
2016-07-07 15:24
does it matter?

jyellick
2016-07-07 15:24
I would say yes.

jyellick
2016-07-07 15:24
Because a byzantine attack can certainly cause a particular consenter to end up with a gap in the log it broadcasts

jyellick
2016-07-07 15:24
(unless we store and somehow state transfer this log)

simon
2016-07-07 15:25
right

jyellick
2016-07-07 15:25
Or, ignore the byzantine attack, it could happen under more benign conditions as we have seen.

simon
2016-07-07 15:25
so when a peer connects to the consensus cloud, it needs to be able to retrieve a partial log

jyellick
2016-07-07 15:27
I believe traditionally, with log replication, you have periodic snapshots, which allows you to garbage collect the log, so that someone who connects either has recent enough state that their log overlaps with the available log, or, they retrieve that snapshot, and then can retrieve the log.

simon
2016-07-07 15:27
or it has a way to retrieve blocks via some other mechanism

simon
2016-07-07 15:28
honestly, i think somebody else needs to design and implement this

jyellick
2016-07-07 15:28
'this'?

simon
2016-07-07 15:29
logs, state transfer

jyellick
2016-07-07 15:31
It all seems like so much duplication of the blockchain to me

jyellick
2016-07-07 15:34
The old executor service worked off the semi-synchronous deal of, send executions, periodically get back checkpoint values, and, send skips including the checkpoint value to skip to. That would completely eliminate the need for us to log or state transfer, push it all back into the peer, and seems like it would work. I'm not sure I love the semi-synchronous nature of it, but I'm struggling to think of anything better.

simon
2016-07-07 15:45
yes it does

simon
2016-07-07 15:46
something is wrong with my code

simon
2016-07-07 15:46
it doesn't execute right

simon
2016-07-07 16:02
@jyellick: can you give me some feedback on what i am doing wrong with the executor interface?

simon
2016-07-07 16:02
i don't see the execute getting through

jyellick
2016-07-07 16:03
Sure, let me pull your branch again

jyellick
2016-07-07 16:06
Can you point me where specifically you're not seeing execution?

simon
2016-07-07 16:07
stuck in chan send in the executor

simon
2016-07-07 16:07
so manager is not running?

jyellick
2016-07-07 16:08
You do need to explicitly start the executor, it does not start at construction

jyellick
2016-07-07 16:08
(Starting the executor starts the backing event manager)

simon
2016-07-07 16:09
oh.

simon
2016-07-07 16:09
who starts it?

jyellick
2016-07-07 16:09
Helper starts it immediately after constructing it

simon
2016-07-07 16:09
hmm

simon
2016-07-07 16:09
odd

simon
2016-07-07 16:09
why do you have to start it explicitly?

jyellick
2016-07-07 16:10
Having it automatically start creates garbage that must be cleaned up. Especially for testing, we want the execution to take place on the testing thread, so we don't have to deal with synchronizing

jyellick
2016-07-07 16:11
Outside of the unit tests, the executor is effectively a singleton, so the additional work after constructing it didn't seem like a big deal

simon
2016-07-07 16:13
aha

simon
2016-07-07 16:14
okay

simon
2016-07-07 16:14
i'm checking out for today

simon
2016-07-07 16:14
been at it already for 9h

simon
2016-07-07 16:14
but it seems to work

simon
2016-07-07 16:15
you can create a basic consensus network now

jyellick
2016-07-07 16:20
Great! Enjoy your evening

simon
2016-07-07 16:45
@jyellick: i just pushed a deploy script

vukolic
2016-07-07 16:53
@simon @jyellick ok there are a few things here

vukolic
2016-07-07 16:55
1) *what is the state of consensus (PBFT)?* The state is only the internal PBFT state - for example: the set of consenters, view number, sequence numbers, P, Q sets and similar. Things related to the ledger (e.g., raw ledger) are not required.

vukolic
2016-07-07 16:56
That said, our implementation of consensus may decide to offer more than the spec - for example, consensus service could (perhaps only best-effort) cash the last K blocks of the raw ledger.

vukolic
2016-07-07 16:56
There might be additional call (not present in the current API) to get those blocks - but I would say this is not "Phase 1"

simon
2016-07-07 16:57
my concern is that if we do not persist the raw blocks we `deliver()`, these blocks might be lost

vukolic
2016-07-07 16:57
This is not our issue - strictly speaking

simon
2016-07-07 16:57
yea

vukolic
2016-07-07 16:57
this is the issue of peers

vukolic
2016-07-07 16:57
some cash needs to exist though in order not to loose all messages

vukolic
2016-07-07 16:57
in case of catastrophes

vukolic
2016-07-07 16:57
that is true

simon
2016-07-07 16:57
well if we consider the consensus network as "miners", we need to retain the chain

vukolic
2016-07-07 16:59
ok, so this is one thing to consider. However one thing is important - this ledger cache if implemented - must not block PBFT "state transfer" among consenters if you see what I mean

simon
2016-07-07 16:59
yes

simon
2016-07-07 16:59
pbft does not need it for its own operation

vukolic
2016-07-07 16:59
we need to entirely decouple the functionality related to PBFT operation and optional raw ledger cache

vukolic
2016-07-07 17:00
We should look how Kafka is doing this - this exists

vukolic
2016-07-07 17:00
so other points

vukolic
2016-07-07 17:01
2) *To how many consenters peer connects*. This is another important one. The first step is to decouple PBFT from peer and in that decoupling we will have first peer trusting the consenter it connects to.

vukolic
2016-07-07 17:01
Of course this is not sufficient

vukolic
2016-07-07 17:01
in general

simon
2016-07-07 17:01
the first step is done

simon
2016-07-07 17:01
pbft runs in a separate process, maintains a separate network

simon
2016-07-07 17:02
and there is a client "library" that provides `Broadcast()` and `Deliver()`

vukolic
2016-07-07 17:02
great - so then we need to *ADD* the functionality that was not there in v0.5 by which a peer *optionally* connects to different consenters if the current one does not work

simon
2016-07-07 17:02
yes, we can do that

simon
2016-07-07 17:03
i'd say the peer needs to do that, using the library

vukolic
2016-07-07 17:03
by the way this is optional as organization maintaining several peers *and* a consenter - may always trust its peer - so this will improve those peers perfromance and simlify their execution path

vukolic
2016-07-07 17:03
now in the general case

vukolic
2016-07-07 17:03
we need other things that are not there yet - such as

vukolic
2016-07-07 17:04
a) as we do not have signatures in Commit messages - a peer must always connect to at least *f+1* consenters and get their confirmations before the library outputs deliver()

vukolic
2016-07-07 17:05
b) if we introduce signatures on commit - we could have a peer tentatively "trusting" a single consenter and waiting for the latter to forward him the commit certificate

vukolic
2016-07-07 17:05
the latter can be silent in which case the peer would connect to another consenter

vukolic
2016-07-07 17:06
now obviously - we have a choice

simon
2016-07-07 17:06
if we sign on commit, then we can include the certificate directly

vukolic
2016-07-07 17:06
a) introduce signatures - or go for f+1 stuff

vukolic
2016-07-07 17:06
you mean if you sign commit msgs?

simon
2016-07-07 17:06
yes

vukolic
2016-07-07 17:07
yes, then once you assemble commit certificate you could forward to client

vukolic
2016-07-07 17:07
this is actually the "only" way to properly do it if we want to avoid connecting to f+1 or more consenters

simon
2016-07-07 17:07
yea

vukolic
2016-07-07 17:07
what is your take on the tradeoff?

vukolic
2016-07-07 17:07
f+1 vs signatures?

simon
2016-07-07 17:07
i'm for signing

simon
2016-07-07 17:08
better scale

simon
2016-07-07 17:08
allows distribution tree

vukolic
2016-07-07 17:09
it seemingly does - we need to see where to sign though - commit msg is an obvious one but not the only one

jyellick
2016-07-07 17:09
I'm still in favor of signing checkpoint messages

jyellick
2016-07-07 17:09
With those corresponding to blocks

simon
2016-07-07 17:10
so that would effectively be a 4th phase

vukolic
2016-07-07 17:11
so we should make a plan on how do we proceed with these additions that we have on current code which stem from the fact that consensus client (peer) is not tied to a consenter (esp. trust-wise)

jyellick
2016-07-07 17:11
Effectively. But by separating it from the standard commit path, you could effectively tweak the rate of signatures by modifying the checkpoint interval.

simon
2016-07-07 17:12
but if we can only deliver() what has a certificate, that doesn't help us

vukolic
2016-07-07 17:12
signing chkpoint is possible but then client/peer connecting to a single consenter would output a burst of deliver() events only every checkpoint

simon
2016-07-07 17:12
we still need to wait for a certificate

vukolic
2016-07-07 17:12
but

vukolic
2016-07-07 17:12
checkpoint is sufficient

vukolic
2016-07-07 17:12
yet not necessary

simon
2016-07-07 17:12
which means that we can just increase our batch size

vukolic
2016-07-07 17:12
f+1 signatures upon commit are sufficient

vukolic
2016-07-07 17:12
so it is like 2f+1 commit signatures

vukolic
2016-07-07 17:12
or f+1 4th phase signatures

jyellick
2016-07-07 17:12
I don't see why we could not deliver periodically, with a separate commit phase

vukolic
2016-07-07 17:13
latency

jyellick
2016-07-07 17:13
Set k=1 and you're to the behavior of signatures on commit

vukolic
2016-07-07 17:13
we would be blowing it up for no strong reason + issues if there is not enough traffic

jyellick
2016-07-07 17:13
But it allows people to trade some latency for some throughput

vukolic
2016-07-07 17:14
so, again, for this particular feature checkpoint is an overkill

vukolic
2016-07-07 17:14
f+1 "4th phase" signatures suffice

vukolic
2016-07-07 17:14
or 2f+1 commit signatures (transferable commit certificate)

vukolic
2016-07-07 17:15
now, in future, as we go for large number of consenters, this may need to be revisited

vukolic
2016-07-07 17:15
client would need to verify for n=100 at least 34 signatures

vukolic
2016-07-07 17:15
but also otherwise connect to 34 consenters

vukolic
2016-07-07 17:16
it will certainly be a challenge to properly scale this - but lets have something more basic for Phase 1 (end of September)

vukolic
2016-07-07 17:16
and worry about scalability in Phases 2 and 3

vukolic
2016-07-07 17:28
this emphasizes the need to have the trust of a peer (consensus client) into consenters configurable

vukolic
2016-07-07 17:28
at one extreme we will have peer trusting "his" consenter (very efficient - not very robust)

vukolic
2016-07-07 17:29
at the other extreme we have the above (f+1 connections or a connections with f+1 or 2f+1 signature verifications) - (very robust - not very efficient)

vukolic
2016-07-07 17:30
one can easily imagine peer/consenter trust policies in between - but practically they may be less appealing

simon
2016-07-07 17:32
maybe the cryptographers have a way to compact all these signatures

simon
2016-07-07 17:33
ok, i gotta go outside

simon
2016-07-07 17:33
staring too much into the screen

vukolic
2016-07-07 17:33
yes - lets discuss this later on further

vukolic
2016-07-07 17:34
we'll get it right :slightly_smiling_face:

vukolic
2016-07-07 17:35
just to conclude for this iteration - I do not particularly like signing commit msgs - but we may want to sign the 4th phase instead

vukolic
2016-07-07 17:35
reasons next time :wink:

jyellick
2016-07-07 18:18
I'm looking at the virtual client / slot stuff again, and I'm wondering about recovering after getting out of sync with the network. Namely, today, we discard our outstanding requests when we get out of sync with the network, as we may have missed their executions

jyellick
2016-07-07 18:20
In an ideal world, we would somehow encode the counter each slot has executed to somehow, either in the checkpoint, or in the consensus metadata, so that we could intelligently decide whether a particular request has been executed or not, rather than simply discarding them

jyellick
2016-07-07 18:20
The problem I am coming up with, is that as the number of allowed outstanding requests goes up, and the number of replicas goes up, this could become quite large.

jyellick
2016-07-07 18:21
Since each counter is 64 bit, were we to allow 1000 outstanding requests per replica (I think this is the upper end of what is reasonable, but possible), then we would have 8KB of counter data per replica

vukolic
2016-07-07 18:21
I lost the context a bit - is this still relevant for v2? Asking since you talk about execution...

jyellick
2016-07-07 18:22
Yes, this is. This comes back to the idea that, per the PBFT paper, a client should submit requests one at a time, and wait for the execution to complete.

jyellick
2016-07-07 18:23
Because the consenters are acting as clients, we can't reasonably only have one outstanding request per consenter.

vukolic
2016-07-07 18:24
how come they act as clients? they will normally not broadcast -except perhaps for some maintenance/reconfiguration operations

jyellick
2016-07-07 18:24
I guess we could re-evaluate for v2, but, assuming you want an endorser to be able to connect to a single consenter and pass in transactions

jyellick
2016-07-07 18:25
Then that single consenter needs to act as the PBFT client, broadcasting the PBFT request to the consenting network.

vukolic
2016-07-07 18:25
in this case the replica/consenter is merely a proxy not a client

vukolic
2016-07-07 18:25
so the broadcaster stays the client (peer)

jyellick
2016-07-07 18:26
Sure, the consenter is not originating the request, but it is the one who is assuming responsibility for it, waiting for its execution to complete before submitting a new request

vukolic
2016-07-07 18:26
not really

jyellick
2016-07-07 18:26
Or rather, before proxying a new request if you prefer

vukolic
2016-07-07 18:26
it should not wait for that IMO

jyellick
2016-07-07 18:26
How then, do you prevent censorship?

vukolic
2016-07-07 18:27
remind me of the problem?

jyellick
2016-07-07 18:27
A consenter broadcasts this proxied request to the network.

jyellick
2016-07-07 18:28
Each receiving replica must first decide whether this request has been executed or not. Because it's possible the network has already done the 3 phase protocol for this request, before the broadcast reaches every replica.

vukolic
2016-07-07 18:28
normally there are client timestamps for this

vukolic
2016-07-07 18:29
also in PBFT

jyellick
2016-07-07 18:29
Yes! There are... but it exposes us to potential censorship without the waiting approach

vukolic
2016-07-07 18:29
ok listening :slightly_smiling_face:

jyellick
2016-07-07 18:29
So, it is easy enough, for the receiving replica to check the timestamp of the request against the timestamp of the most recently executed request, and, if the timestamp is older, to discard, and this is our current behavior.

vukolic
2016-07-07 18:30
so far so good

jyellick
2016-07-07 18:30
Now, imagine that a replica broadcasts two requests to the network, request A with timestamp of 1, and request B with timestamp of 2

jyellick
2016-07-07 18:31
The primary byzantinely wants to censor request A, so, it first orders request B

jyellick
2016-07-07 18:32
Now, the network executes B, so, any replica which receives request A will believe it is stale and already executed, and discard it.

vukolic
2016-07-07 18:32
ok, right so there are two things here

vukolic
2016-07-07 18:32
first if consenter = proxy not client

vukolic
2016-07-07 18:33
then we can have client still have 1 outstanding request as different proxied requests will have different timestamp (clientID,clientTimestamp)

vukolic
2016-07-07 18:33
so that would be still ok - you do not need 1 request proxied at the time

vukolic
2016-07-07 18:33
now

vukolic
2016-07-07 18:33
to allow the client/peer to broadcast requests in paralel we do face the issue you mention

vukolic
2016-07-07 18:34
and in that case we need FIFO for client requests

jyellick
2016-07-07 18:34
Ah, so, I had a different implementation idea

vukolic
2016-07-07 18:34
ok, listening :slightly_smiling_face:

jyellick
2016-07-07 18:34
That does not require FIFO

vukolic
2016-07-07 18:34
BTW FIFO may be of independent interest

vukolic
2016-07-07 18:34
as aguarantee to clients

vukolic
2016-07-07 18:35
but go on

jyellick
2016-07-07 18:35
So, each replica maintains some number of "virtual client ids", I usually refer to them as 'slots', as there are a fixed number of them and they are occupied or vacant.

vukolic
2016-07-07 18:35
(because one request at the time somehow gives FIFO - so we may want to maintain that if we go for multiple "paralell broadcasts")

jyellick
2016-07-07 18:35
When a request is proxied, the replica looks for an empty slot, and assigns the request that virtual client ID before broadcasting to the network

jyellick
2016-07-07 18:36
Each slot has a slot local counter (akin to a timestamp), and the promise is that a non-byzantine replica will never broadcast two requests with the same slot number until the previous request has prepared

vukolic
2016-07-07 18:37
ok so slot is like a window? shared by many clients or 1 window per client?

jyellick
2016-07-07 18:38
Sort of like a window I suppose, potentially shared by many clients.

jyellick
2016-07-07 18:38
The number of slots is the number of total outstanding requests allowed for each replica

jyellick
2016-07-07 18:39
And the replica might choose to allocate all slots to a single client, or to distribute them among many clients

vukolic
2016-07-07 18:39
hm, this may already be limiting if we have a lot of clients but they are fine with broadcasting 1 request at the time

vukolic
2016-07-07 18:39
but go on

jyellick
2016-07-07 18:42
Slots hold requests which are not prepared. Once a request is prepared, the last prepared counter associated with that slot is set to the value of the counter for the request that was prepared. Any request which is received who's counter is less than this last prepared counter will be assumed to be stale, and discarded.

jyellick
2016-07-07 18:42
This prevents any request from being multiply executed.

jyellick
2016-07-07 18:43
Because no new requests are sent for a slot until that request is prepared, it is easy to run a timer against the slot to detect censorship.

jyellick
2016-07-07 18:43
And the degree of parallelism is controlled by the number of slots.

vukolic
2016-07-07 18:43
i was wondering could we simply view change the leader who order request with timestamp 2 instead of 1 for given client

vukolic
2016-07-07 18:44
and basically track per client the last sequence number

jyellick
2016-07-07 18:44
Ah, if we have FIFO, we could

vukolic
2016-07-07 18:44
so this would be somehow implementing fifo, no?

vukolic
2016-07-07 18:45
primary would put an advance request n+2 in some queue and not actually process it unless n+1 is processed

vukolic
2016-07-07 18:45
to prevent DoS the n+k above could be limited by k

vukolic
2016-07-07 18:45
which is the number of outstanding reqs a client may have

vukolic
2016-07-07 18:45
hard coded - to begin with

vukolic
2016-07-07 18:45
now k=1

jyellick
2016-07-07 18:46
Hmmm, let me think

vukolic
2016-07-07 18:46
sure - I need to transfer to France v. Germany

jyellick
2016-07-07 18:46
The tricky part abou this implementation

vukolic
2016-07-07 18:46
but will be back :slightly_smiling_face:

jyellick
2016-07-07 18:46
Is that the primary does not necessarily send pre-prepares in order

jyellick
2016-07-07 18:47
So to conclude whether the primary is sequentially ordering requests gets trickier

vukolic
2016-07-07 18:47
yes so this would be making it to look at client timestamp and actually do them in order

vukolic
2016-07-07 18:47
I think it would be vital to enforce FIFO if we allow parallel requests

vukolic
2016-07-07 18:47
not sure submitting peer would like it differently...

jyellick
2016-07-07 18:48
But imagine that client submits reqs A,B,C with increasing timestamps.

jyellick
2016-07-07 18:49
The primary orders B,C into seqNo=3, but has not sent preprepare for seqNo=2 yet

jyellick
2016-07-07 18:49
The network will prepare and commit this, because there's nothing obviously wrong with it.

vukolic
2016-07-07 18:49
ok, yes so we would need to eliminate this watermark thingy

vukolic
2016-07-07 18:50
and basically rely on batching for throughput optimisation

vukolic
2016-07-07 18:50
and do batches 1 by 1

jyellick
2016-07-07 18:50
Yes, we could do this, but it would be a pretty significant change

vukolic
2016-07-07 18:50
not necessarily bad... and it can be easy to try out (by hardocding watermarks)

vukolic
2016-07-07 18:50
to H=L+1

vukolic
2016-07-07 18:51
or H=L not sure how that goes

vukolic
2016-07-07 18:52
side comment: I remember talking to Alysson Bessani (lead of BFT Smart) - he told me they eliminated watermarks early on...

jyellick
2016-07-07 18:52
Interesting. I think we could certainly completely eliminate watermarks if we assume an underlying FIFO stream

jyellick
2016-07-07 18:53
But, we do have working watermarks today. I really think we could switch to UDP transports, and things would continue to work.

vukolic
2016-07-07 18:53
that might be more straightforward to do - put H=L and make primary order things from clients one by one

vukolic
2016-07-07 18:53
well we may have both

vukolic
2016-07-07 18:53
just a note to system admins

vukolic
2016-07-07 18:54
do not put H!=L (watermarks) and K>1 (paralel req FIFO) at the same time

jyellick
2016-07-07 18:55
I don't know, I think we should commit one way or the other

jyellick
2016-07-07 18:55
The additional code complexity of supporting both windowed and non-windowed modes seems like a waste

vukolic
2016-07-07 18:55
ok, I think watermarks may have limited use - it was PBFT way of doing batching

vukolic
2016-07-07 18:55
so batching handles that

vukolic
2016-07-07 18:56
and then paralel FIFO is not too complex

vukolic
2016-07-07 18:56
with primary needing to look at client ts and do +1

vukolic
2016-07-07 18:56
that queue might be problematic

vukolic
2016-07-07 18:56
but I would limit it to K

vukolic
2016-07-07 18:56
hard coded

vukolic
2016-07-07 18:56
for all clients

vukolic
2016-07-07 18:56
later on one could play with more flexible Ks

vukolic
2016-07-07 18:57
need to take off

vukolic
2016-07-07 18:57
thanks for the discussion

vukolic
2016-07-07 18:57
will catch up later

jyellick
2016-07-07 18:57
Alright, thanks for the chat, will think on this

jyellick
2016-07-07 19:13
@vukolic Independent of all the of the discussion above. In the Castro paper, each client submits a request one at a time, and the replica tracks the timestamp per replica to determine whether it should send a view change or not. It is somewhat handwaved away that this information could be stored on disk, but what is not clear to me, is how this information is updated during fall behind / catch up scenarios. If a replica crashes after receiving a request and must do state transfer, does it discard its outstanding requests? If so, then what prevents a cascade of failures from censoring requests? Imagine 4 nodes, all receive a request, then vp0 crashes, recovers, discards outstanding, and then so do vp1,2,3 in sequence. At this point, all outstanding requests have been discarded, and the request may have, or may not have executed. To me, the correct way to handle this, is to track the last executed request for each client, and somehow have this recovered via state transfer. My problem here is, as they mention, there could be lots of clients, so recovering this data via state transfer might be very expensive indeed.

simon
2016-07-07 21:25
the client is responsible for resubmitting the request?

simon
2016-07-07 21:26
make `broadcast()` a RPC (it is already) and have it return only when the request has been prepared

jyellick
2016-07-07 21:33
We could certainly require the client to resubmit it. But I think we need to clearly define that as part of the v2 architecture then

vukolic
2016-07-08 07:12
yes the client, in the end, in BFT protocols usually cares for having its request (periodically) resubmitted

vukolic
2016-07-08 08:15
*re caching raw ledger at consensus service*

vukolic
2016-07-08 08:17
we may want to follow the footsteps of Kafka - which has configurable retention time for the partition retention http://kafka.apache.org/documentation.html#intro_topics

vukolic
2016-07-08 08:17
(it does not store the whole log for eternity - nor I think should we)

vukolic
2016-07-08 08:18
now to facilitate the usage of this cache - Kafka allows the consumer (peer in our case) to *seek* on a given offset (batch) in the partition (ledger)


vukolic
2016-07-08 08:20
the question is how do we want this - I really would like to see - sth slightly different from Kafka for state transfer in which state transfer would be peer-to-peer not peer-consenter oriented

vukolic
2016-07-08 08:20
Kafka does not have interconsumer state transfer and I strongly believe we should. IMO, we need to avoid, or discourage, peers from "torturing" consensus service with state transfer and make them rely predominantly on peer-to-peer for state transfer.

vukolic
2016-07-08 08:21
BTW, another important design point as we have consenters split from peers is (Push vs Pull) http://kafka.apache.org/documentation.html#design_pull

vukolic
2016-07-08 08:22
when peer is colocated with a consenter and trusts it - push (our v0.5) is obvious choice

vukolic
2016-07-08 08:22
not so clear with the separation...

vukolic
2016-07-08 08:22
comments welcome

vukolic
2016-07-08 08:23
*re signatures vs multi connections*

vukolic
2016-07-08 08:24
I would have basically two parts of the consensus client library: 1) one in which peer trusts "his" consenter and 2) the other one in which does not

vukolic
2016-07-08 08:24
clearly 1 is not BFT but has its place in practice (single organization running peers trusting "its" consenter)

vukolic
2016-07-08 08:25
now for 2) we discussed signatures vs. multi-connections

vukolic
2016-07-08 08:25
with multi-connections we could reuse 1st part of the library and get only hash of the committed batches confirmation on other f connections

vukolic
2016-07-08 08:25
this would be alternative to signatures - as part of brainstorming

vukolic
2016-07-08 08:26
comments welcome

vukolic
2016-07-08 08:26
*- over and out -*

vukolic
2016-07-08 08:34
@simon @jyellick @kostas @tuand ^^

cca
2016-07-11 07:31
Your 1) and 2) are basically two different interfaces to the consensus service. should be hidden behind the same interface, but allow one configuration option for the client to express which "submission semantics" it wants. My ideas for names: submission=fast/slow or weak/strong or optimistic/guaranteed or invocation= ""

simon
2016-07-11 08:25
given that we're talking about exchangable consensus protocols, i think the fabric peer consensus protocol API should be the same

simon
2016-07-11 08:26
i.e. ideally not to require a different "stub" that connects to consensus in a proprietary way

tuand
2016-07-11 12:49
hi

simon
2016-07-11 12:51
hi tuand

tuand
2016-07-11 12:52
ok, so a week's worth of stuff to catch up on :slightly_smiling_face:

simon
2016-07-11 12:54
hehe

simon
2016-07-11 12:54
not much happened outside of consensus

simon
2016-07-11 13:07
so... i need to track the lastExec in the new consensus peer

simon
2016-07-11 14:16
i need some better reconnect logic

simon
2016-07-11 14:16
grpc seems to use exponential retries

simon
2016-07-11 14:22
1200 tx/sec

simon
2016-07-11 14:23
with consensus separated

simon
2016-07-11 14:23
and batchsize=500

simon
2016-07-11 14:29
okay

simon
2016-07-11 14:29
fabric peer now can talk to consensus cloud

simon
2016-07-11 14:29
i'd appreciate if you could test it

simon
2016-07-11 14:30
then maybe do some refactoring

simon
2016-07-11 14:30
and then integrate it to master

simon
2016-07-11 14:33
now it slowed down to 880

simon
2016-07-11 14:33
i wonder why

simon
2016-07-11 14:40
hmm 30% docker

simon
2016-07-11 14:40
is it still logging like crazy?

jyellick
2016-07-11 14:43
@simon: I noticed in my perf testing that the throughput slowly dropped over time

simon
2016-07-11 14:43
it was a local interaction with a VM

simon
2016-07-11 14:43
i think

jyellick
2016-07-11 14:43
This was with your shortcircuited chaincode too

simon
2016-07-11 14:43
oh, now with externalization we are no longer closed loop

simon
2016-07-11 14:43
ah, then that must be ledger

simon
2016-07-11 14:44
but apart from the closed loop, it looks good

simon
2016-07-11 14:44
none of the fast path stuff was tested

jyellick
2016-07-11 15:47
@simon Reviewing your branch, the thing that worries me is: ``` func (c *Server) SyncToTarget(blockNumber uint64, blockHash []byte, peerIDs []*pb.PeerID) (error, bool) { panic("not implemented") } // func (c *Server) GetBlockchainInfoBlob() []byte { // XXX assemble state for consensus service // XXX this probably should include last block hash, etc. //panic("not implemented") return []byte("some internal state") } ``` Obviously this can't be handled until we actually figure out the inputs/outputs of consensus, but until these get implemented, almost all failure test scenarios will not succeed.

simon
2016-07-11 15:47
yes

simon
2016-07-11 15:47
but we don't have tests, so that's no problem :slightly_smiling_face:

jyellick
2016-07-11 15:48
Haha, I suppose that is one fix

simon
2016-07-11 15:49
we need to figure out how we want to store consensus config and apply changes

jyellick
2016-07-11 15:52
Without the split, the very natural place is the blockchain, with the split, I'm less convinced. I also wonder about the growing peers. I know some discussion had been made about 'longest chain wins' with respect to determining which whitelist to trust, but I don't know if that ever went anywhere.

simon
2016-07-11 15:56
yea, policy is a topic for another time

simon
2016-07-11 15:56
right now the consensus peers are part of the persist config data

simon
2016-07-11 16:31
kostas: any questions yet?

simon
2016-07-11 16:31
otherwise i'll check out

kostas
2016-07-11 16:31
Sorry, didn't realize you were waiting on me and was still working on my pre-prepare branch. Will look at it tonight and post questions. We can resume tomorrow.

simon
2016-07-11 16:34
okay

simon
2016-07-11 16:34
is there something for me to work on tomorrow?

simon
2016-07-11 16:35
i mean, what are our concrete goals

simon
2016-07-11 16:35
maybe we can make this all a bit agily

simon
2016-07-11 16:36
because so far we've been doing scrum meetings, but we didn't do the "these are the things we want to get done, what do you think how long this will take", etc.

simon
2016-07-11 16:43
@kostas: what is that pre-prepare change exactly?

kostas
2016-07-11 16:45
modifying `pre_prepare` to take a `request_block` rather than a `request` - this ripples into a bunch of things in pbft-core

simon
2016-07-11 16:45
and this is so that when we merge batch and core, we have transparency on the actual request level and don't have to deal with opaque blocks?

kostas
2016-07-11 16:45
Correct.

simon
2016-07-11 16:46
okay

simon
2016-07-11 16:47
we need to figure out a way how to reference requests without re-marshalling them just to do a hash over them

kostas
2016-07-11 16:48
I saw your comments on the performance hit this results in, yes.

simon
2016-07-11 16:48
oh, if you rebase onto my code, you also can build and run actual networks much more quickly

kostas
2016-07-11 16:48
Excellent, I shall give it a shot.

simon
2016-07-11 16:48
if everybody is okay, i will rebase my branch onto lastest master

simon
2016-07-11 16:48
we shouldn't do both at the same time

simon
2016-07-11 16:48
or commit trouble

simon
2016-07-11 16:49
let me see how quickly i can do that

kostas
2016-07-11 16:50
Right, if you rebase onto the master (pulling in the obc-renaming, and the new discovery service), I can rebase onto your code and get all of those changes in.

kostas
2016-07-11 16:50
Although the discovery service in `fabric` is ultimately irrelevant to your `separate-consensus` work as I understand it.

simon
2016-07-11 16:58
haha discovery service

simon
2016-07-11 16:59
that one didn't last long

simon
2016-07-11 17:12
okay

simon
2016-07-11 17:12
rebased

simon
2016-07-11 17:13
i think at some point we may want to squash some of the commits together

simon
2016-07-11 17:13
but let's do that only just before merging into master

simon
2016-07-11 17:14
@kostas: ready to go

kostas
2016-07-11 17:14
Gotcha, thank you. Will rebase on your branch.

simon
2016-07-11 17:14
cool

simon
2016-07-11 17:15
i guess then i pin down that branch and start working on a new name?

simon
2016-07-11 17:15
making this our new tentative master

kostas
2016-07-11 17:15
Works for me.

kostas
2016-07-11 17:16

simon
2016-07-11 17:16
i think i'll do some performance tests tomorrow and maybe play with building either a bft-smart or a kafka consensus service

simon
2016-07-11 17:16
yes

kostas
2016-07-11 17:16
OK

simon
2016-07-11 17:21
i guess by using bft smart, we could have avoided man months of development work

simon
2016-07-11 17:21
oh well

sheehan
2016-07-11 19:38
do you plan to move consensus into its own repository?

kostas
2016-07-11 19:38
I think it'd be great if we could do that.

kostas
2016-07-12 06:19
After receiving a `batchMessage-Request` message https://github.com/hyperledger/fabric/blob/master/consensus/pbft/batch.go#L325 shouldn't we be doing a check to make sure that the `ReplicaID` included in the message is the same as that of the receiving stream? (Basically, something like this: https://github.com/hyperledger/fabric/blob/master/consensus/pbft/pbft-core.go#L571)

simon
2016-07-12 09:11
i think we should get rid of the replicaid thing

simon
2016-07-12 09:12
and pass it in via recvmsg

simon
2016-07-12 09:12
wait, we do that

simon
2016-07-12 09:12
so hm

simon
2016-07-12 09:29
we need to figure out how to store the consensus config

simon
2016-07-12 09:39
options: 1. yaml text, parsed on start. cons: viper tolerates misspelled keys. i don't like it 2. ini text, parsed on start. there are ini parsers that parse directly into config structs. 3. $config text, parsed on network creation, into config struct. struct stored as grpc bytestream, grpc unmarshal on start. 4. $config text, parsed on network creation. values are stored in separate keys, and individually parsed on startup. seems fiddly.

simon
2016-07-12 11:22
something happened and i seem to only be able to process one request per second, very odd

simon
2016-07-12 11:23
somehow not all requests make it into the batch...

simon
2016-07-12 11:24
OH

simon
2016-07-12 11:24
my requests are all the same -_-

simon
2016-07-12 11:24
and therefore get filtered

simon
2016-07-12 13:02
so i get around 8500 transactions per second

simon
2016-07-12 13:03
which is really really bad

simon
2016-07-12 13:03
and the cpu profile is all over the place

simon
2016-07-12 13:03
marshaling, network IO, tls?

simon
2016-07-12 13:04
maybe the rpc overhead indeed is too high

simon
2016-07-12 13:08
15500 with pbft short circuited

simon
2016-07-12 13:09
so the overhead seems to be 50:50 pbft and grpc

simon
2016-07-12 13:13
super poor number, 15k

simon
2016-07-12 14:14
@jyellick, @kostas, @tuand: ideas about the config thing?

jyellick
2016-07-12 14:14
Are you on the phone?

simon
2016-07-12 14:14
i am

simon
2016-07-12 14:15
i can hear your slack sounds :slightly_smiling_face:

kostas
2016-07-12 15:50
So, going back to this:

kostas
2016-07-12 15:51
I'm not sure what $config text means?

jyellick
2016-07-12 15:55
I'm not sure any of the options are great. 1. I have no real issue with this, beyond yours stated 2. ini is an odd choice to me, seems like a somewhat antiquated and fiddly format of itself 3. I see the benefit to doing this is binary, keeps people from thinking they can screw with it, but, simultaneously hard to see what's going on 4. Not really sure what this means I think my vote is 1.

tuand
2016-07-12 15:55
should we have the same config system across all of fabric ? I vote for 1 or make a change across the board

simon
2016-07-12 15:55
yaml is way bad

simon
2016-07-12 15:56
not acceptable for a product

simon
2016-07-12 15:56
or at least viper

simon
2016-07-12 15:56
ini is just `key = value` pairs, essentially

jyellick
2016-07-12 15:57
What is so way bad about it?

jyellick
2016-07-12 15:57
(But agree, consistency is valuable here)

simon
2016-07-12 16:02
it doesn't validate keys

kostas
2016-07-12 16:02
(Still not clear on the $config text by the way.)

simon
2016-07-12 16:02
so you can hunt for hours for a typo

simon
2016-07-12 16:03
i don't think we should stick with a bad way just for perceived consistency

jyellick
2016-07-12 16:05
I'm not sure what makes it 'perceived' and not 'actual'

jyellick
2016-07-12 16:06
But I guess the complaint is that viper has no config schema to complain about unnecessary or missing keys?

simon
2016-07-12 16:06
yes

simon
2016-07-12 16:06
together with yaml whitespace formatting, etc.

simon
2016-07-12 16:06
just not good

jyellick
2016-07-12 16:07
Because my recollection is that yaml does support schemas

jyellick
2016-07-12 16:07
Not certain if viper does

jyellick
2016-07-12 16:11
Looks like not. But I'm not sure how ini solves any of these? I've seen multiple ini formats, and there's no native way to detect missing or misspelled keys. Obviously you could code something up to do that, but I'm not sure what prevents this from being done against yaml as well.

jyellick
2016-07-12 16:12
( [what I was thinking of](http://www.kuwata-lab.com/kwalify/) )

simon
2016-07-12 16:17
there are go packages that do ini parsing

simon
2016-07-12 16:17
into a struct

michele
2016-07-12 17:04
has joined #fabric-consensus-dev

simon
2016-07-13 11:47
so with a crude "rpc" implementation in C (no TLS), i get to 172k ops/s on my laptop

simon
2016-07-13 11:48
just to get an upper bound

simon
2016-07-13 11:48
so using grpc with go is 8% of this performance figure

kostas
2016-07-13 11:48
Are yesterday's numbers w/o TLS as well?

simon
2016-07-13 11:49
no

simon
2016-07-13 11:50
these were go grpc with tls

simon
2016-07-13 11:52
i don't know whether it is worth to hack in gnutls to see a performance difference

simon
2016-07-13 12:56
83k ops/sec with gnutls anon

simon
2016-07-13 12:56
@kostas: happy? :slightly_smiling_face:

vukolic
2016-07-13 13:07
ouch

vukolic
2016-07-13 13:07
what is an op? still Bitcoin size?

simon
2016-07-13 13:39
couple of bytes

simon
2016-07-13 13:39
not much

simon
2016-07-13 13:40
heh it dropped to 75k, probably because of thermal limiting

simon
2016-07-13 13:42
vukolic: the size doesn't impact things much

simon
2016-07-13 13:42
localhost bandwidth is not a problem

vukolic
2016-07-13 13:46
so it is 1/latency of the thingy

vukolic
2016-07-13 13:47
but 75k vs 15k is a lot

simon
2016-07-13 13:56
well, one is C without any magic dispatch system

simon
2016-07-13 14:14
with streaming rpc: ~13k ops/sec

simon
2016-07-13 14:20
38k ops/sec with streaming without pbft

simon
2016-07-13 14:26
so go+streaming grpc = 1/2 performance of C

simon
2016-07-13 14:26
which is fine

vukolic
2016-07-13 15:37
ack

scottz
2016-07-13 22:43
why is "F" defined in consensus/obcpbft/config.yaml? I thought it was simply computed = (N-1)/3. Should we change F if I change "N"?

kostas
2016-07-13 22:44
N can actually be > 3f+1

kostas
2016-07-13 22:44
3f+1 is the minimum value it can get


cca
2016-07-14 06:58
@scottz: with N >> 3f one could run more efficiently if not the full resilience is needed.

simon
2016-07-14 09:17
so what was the reason why we don't like signatures on commit messages?

simon
2016-07-14 10:50
radical idea: assume that don't have arbitrary chaincode, but only a DSL describing an endorsement enforcement policy, i.e. something that deterministically describes a required set of signatures. changes to the ledger (key/value store) are authorized by this policy. do we still need a BFT network for ordering/appending to the ledger?

simon
2016-07-14 12:08
@tuand @jyellick @kostas did any of you get the chance of reviewing my separate-consensus branch?

kostas
2016-07-14 12:59
@simon: I can test it out right now - let's do the Q&A to build that README while at it

kostas
2016-07-14 12:59
So, how do you build and execute?

simon
2016-07-14 13:00
`cd consensus-peer`

simon
2016-07-14 13:00
`go build`

kostas
2016-07-14 13:01
And then we do `./local-deploy.sh /tmp 4` or sth along these lines?

simon
2016-07-14 13:01
then do `./local-deploy.sh foo 4`

kostas
2016-07-14 13:01
Done, what's next

simon
2016-07-14 13:02
`run-1.sh`

kostas
2016-07-14 13:02
What's the result of running local-deploy?

simon
2016-07-14 13:02
possibly drop a consensus.yaml into the current directory

simon
2016-07-14 13:02
check out the `foo` directory

kostas
2016-07-14 13:02
Roger

simon
2016-07-14 13:03
so just run all 4 peers

kostas
2016-07-14 13:03
@kostas uploaded a file: https://hyperledgerproject.slack.com/files/kostas/F1RMW9570/screen_shot_2016-07-14_at_09.02.59.png and commented: Empty. Should I have dropped the YAML file in there before?

simon
2016-07-14 13:04
why empty

simon
2016-07-14 13:04
must have errored

kostas
2016-07-14 13:04
You see all the output in the screenshot.

simon
2016-07-14 13:04
seems your shell doesn't print error results

simon
2016-07-14 13:05
do you have `certtool`?

kostas
2016-07-14 13:05
I do not

simon
2016-07-14 13:05
well that would be it

simon
2016-07-14 13:05
it's part of gnutls

kostas
2016-07-14 13:05
Alright, let me get that then.

kostas
2016-07-14 13:08
Cool, that worked.

kostas
2016-07-14 13:08
Drop `consensus.yaml` into `foo` comes next?

simon
2016-07-14 13:08
for example, yes

simon
2016-07-14 13:09
if you did a `go install`, then you don't have to fiddle around with `PATH`

kostas
2016-07-14 13:14
Alright, so it works even w/o dropping a `consensus.yaml` in there

simon
2016-07-14 13:15
yea, it'll magically use it

simon
2016-07-14 13:15
but just by chance, i think

simon
2016-07-14 13:15
so you have a running network now?

kostas
2016-07-14 13:16
Just so we're on the same page - is `consensus.yaml` a renamed `pbft/config.yaml`?

simon
2016-07-14 13:16
oh, then it is called config.yaml

simon
2016-07-14 13:16
i didn't change the config loading code

kostas
2016-07-14 13:16
Gotcha, let me try something.

simon
2016-07-14 13:22
and?

kostas
2016-07-14 13:26
If I run `./run-1.sh` I'm replica vp3

simon
2016-07-14 13:26
yes

kostas
2016-07-14 13:26
If I run `./run-2.sh` I would expect to be another replica, correct?

simon
2016-07-14 13:26
yes

kostas
2016-07-14 13:26
I'm still `vp3`

simon
2016-07-14 13:27
wat

kostas
2016-07-14 13:27
Same for run-3 and run4

kostas
2016-07-14 13:27
And hmm, I see an error message.

simon
2016-07-14 13:27
which message?

kostas
2016-07-14 13:28
`listen tcp :7100: bind: address already in use`. Am I supposed to be able to launch all of these processes from the same vagrant session?

simon
2016-07-14 13:28
yes

simon
2016-07-14 13:28
can you show the contents of run-2.sh?

kostas
2016-07-14 13:29
```vagrant@hyperledger-devenv:v0.0.10-bab9e41:/opt/gopath/src/github.com/hyperledger/fabric/consensus-peer/foo$ ./run-2.sh listen tcp :7100: bind: address already in use 2016/07/14 13:23:10 replica 0: 96b6ffc13c4b4626e6c9c12c08518b97499ce3023740871e0e276652ef1434d8 2016/07/14 13:23:10 replica 1: 96b6ffc13c4b4626e6c9c12c08518b97499ce3023740871e0e276652ef1434d8 2016/07/14 13:23:10 replica 2: 96b6ffc13c4b4626e6c9c12c08518b97499ce3023740871e0e276652ef1434d8 2016/07/14 13:23:10 replica 3: 96b6ffc13c4b4626e6c9c12c08518b97499ce3023740871e0e276652ef1434d8 2016/07/14 13:23:10 we are replica vp3 (96b6ff [:6102])```

kostas
2016-07-14 13:29
Let me know if you want more.

simon
2016-07-14 13:29
uhm

simon
2016-07-14 13:29
why are all the certificates the same?

simon
2016-07-14 13:29
that should not happen

kostas
2016-07-14 13:29
This is where you come in and help me debug it.

simon
2016-07-14 13:30
something about no entropy

simon
2016-07-14 13:30
certtool should create different certificates

simon
2016-07-14 13:30
hm

simon
2016-07-14 13:30
can you show the contents of run-2.sh?

kostas
2016-07-14 13:31
Last line is `consensus-peer -addr :6102 -cert cert2.pem -key key.pem -data-dir data2 "$@"`

kostas
2016-07-14 13:31
And all the `cert*.pem` files are indeed identical.

simon
2016-07-14 13:31
okay, the listen error is about the profiling port

simon
2016-07-14 13:31
which is fine

simon
2016-07-14 13:32
that's why it isn't fatal

simon
2016-07-14 13:32
something about certtool in your vagrant is really wrong

kostas
2016-07-14 13:32
Googling.

simon
2016-07-14 13:32
even if you repeat you get the same certificates?

kostas
2016-07-14 13:33
Let me re-run the deploy script.

simon
2016-07-14 13:34
you'll have to remove the directory

kostas
2016-07-14 13:34
Yup yup.

kostas
2016-07-14 13:34
Still the same. This is odd.

simon
2016-07-14 13:35
well, you should be able to run all of this outside vagrant

simon
2016-07-14 13:37
@kostas: try removing the 2>/dev/null on the certtool invocations

simon
2016-07-14 13:37
maybe it will say something interesting

kostas
2016-07-14 13:37
On it.


simon
2016-07-14 13:40
hum

simon
2016-07-14 13:41
your certtool is damaged

kostas
2016-07-14 13:41
Huh.

simon
2016-07-14 13:41
it uses a super short serial

simon
2016-07-14 13:42
probably a 10 year old thing in debian as usual

kostas
2016-07-14 13:42
:simple_smile:

kostas
2016-07-14 13:42
I got it via `sudo apt-get install gnutls-bin` -- that would explain it.

kostas
2016-07-14 13:43
Will install from source and give it another go.

simon
2016-07-14 13:44
nonono

simon
2016-07-14 13:44
let's make it work with vagrant

kostas
2016-07-14 13:45
I'm listening.

simon
2016-07-14 13:51
okay, i pushed a new version

kostas
2016-07-14 13:53
`date: invalid date ‘%s%9N’`?

simon
2016-07-14 13:53
oh are you kidding me

simon
2016-07-14 13:54
what kind of linux is this?

kostas
2016-07-14 13:54
Ubuntu 14.04.04

simon
2016-07-14 13:55
2 years old and probably even older

simon
2016-07-14 13:55
well, what can date produce

kostas
2016-07-14 13:56
If you're asking for its output, it looks like this: `Thu Jul 14 13:55:43 UTC 2016`

simon
2016-07-14 13:56
well new date can use %N to output nanoseconds

simon
2016-07-14 13:58
okay

simon
2016-07-14 13:58
try again

kostas
2016-07-14 13:59
`date +%N` gives me nanoseconds FWIW

kostas
2016-07-14 14:00
Don't know if you're ready for this...

kostas
2016-07-14 14:01
`date: invalid date ‘%s1’`

simon
2016-07-14 14:01
OH WHAT

simon
2016-07-14 14:02
what is this about?

simon
2016-07-14 14:02
so %N gives nanosec, and %s gives sec, but %s%N doesn't work?

simon
2016-07-14 14:02
or %s%9N

simon
2016-07-14 14:02
or %s1

simon
2016-07-14 14:02
what is this lunacy

kostas
2016-07-14 14:03
Let me try something.

kostas
2016-07-14 14:05
Can you give me an example output of `%s%9N` from your machine?

simon
2016-07-14 14:05
it's seconds then nanoseconds

simon
2016-07-14 14:05
```% date +'%s%9N' 1468505127401261388 ```

kostas
2016-07-14 14:05
```consensus-peer$ date +'%s%9N' 1468505146450186473```

kostas
2016-07-14 14:06
So, that works.

simon
2016-07-14 14:06
so what is going on?

simon
2016-07-14 14:06
OH

simon
2016-07-14 14:06
the +

simon
2016-07-14 14:06
oh my.

kostas
2016-07-14 14:06
Yes.

simon
2016-07-14 14:06
why did this work then?

simon
2016-07-14 14:06
a mystery

simon
2016-07-14 14:07
okay, i'm going to rewind this branch

simon
2016-07-14 14:07
just because there is too much embarrasment

kostas
2016-07-14 14:07
Yup.

simon
2016-07-14 14:08
okay

simon
2016-07-14 14:14
@kostas: does it work now?

kostas
2016-07-14 14:14
Was talking to @jzhang for a sec, checking now.

kostas
2016-07-14 14:19
`error parsing command line: template1.cfg: given number '1468505963808852495' was too big or too small in option 'serial' at position 51 in config file`

simon
2016-07-14 14:22
LOL

simon
2016-07-14 14:22

simon
2016-07-14 14:23
okay, i pushed a rewound version

simon
2016-07-14 14:23
this vagrant is aggravating

simon
2016-07-14 14:23
aggravant

kostas
2016-07-14 14:25
This should work, the certs are different now.

kostas
2016-07-14 14:28
We're good.

simon
2016-07-14 14:31
phew

simon
2016-07-14 14:33
does the network work now?

simon
2016-07-14 14:38
@kostas: did i lose you?

kostas
2016-07-14 14:39
You did not. The network runs fine, we're good.

simon
2016-07-14 14:40
ah cool

simon
2016-07-14 14:40
you can connect to it with the test-client

kostas
2016-07-14 14:40
Can you give me instructions on how to do it?

simon
2016-07-14 14:40
sure

simon
2016-07-14 14:40
`cd test-client; go build`

simon
2016-07-14 14:41
`./test-client -addr :6101 -cert $pathto/cert1.pem -listen`

simon
2016-07-14 14:42
`./test-client -addr :6102 -cert $pathto/cert2.pem -broadcast "hi"`

kostas
2016-07-14 14:44
Neat!

simon
2016-07-14 14:45
you can also use -parallel 10 on the broadcast to run a performance test

kostas
2016-07-14 14:45
Can you tell me what happens with each command? When you pass the `-listen` flag what happens under the covers?

simon
2016-07-14 14:45
or 20

simon
2016-07-14 14:45
it connects and subscribes to the `Deliver` stream

simon
2016-07-14 14:46
now what you can do is run a fabric peer connected to the consensus service

simon
2016-07-14 14:47
`CORE_PEER_VALIDATOR_CONSENTER_ADDRESS=:6101 CORE_PEER_VALIDATOR_CONSENTER_CERT_FILE=$pathto/cert1.pem ./peer node start`

kostas
2016-07-14 14:47
And `-broadcast "foo"` creates a transaction with that payload?

simon
2016-07-14 14:47
yes exactly

kostas
2016-07-14 14:48
When I do the `-parallel` thing, the output looks like this:

kostas
2016-07-14 14:48
`success: 123.95 failure: 0.00`

simon
2016-07-14 14:48
rate per second

kostas
2016-07-14 14:48
What's that number next to `success`?

kostas
2016-07-14 14:48
Gotcha.

kostas
2016-07-14 14:48
And what does the `10` stand for?

kostas
2016-07-14 14:48
I was expecting 10 transactions to be honest.

simon
2016-07-14 14:48
10 parallel goroutines

simon
2016-07-14 14:49
submitting transactions

kostas
2016-07-14 14:49
With each of them submitting how many txs?

simon
2016-07-14 14:49
forever

kostas
2016-07-14 14:49
Oh, so I ctrl+c?

simon
2016-07-14 14:49
yea

kostas
2016-07-14 14:50
Gotcha, 100 tps by the way.

kostas
2016-07-14 14:51
This is great.

kostas
2016-07-14 14:51
(Not the number, but the way the network is spawned.)

kostas
2016-07-14 14:52
I say you go for it and do a PR, we can review the code then.

kostas
2016-07-14 14:56
But I'd definitely add a `README.md` before the PR though.

simon
2016-07-14 15:05
well it definitely degrades resilience compared to v0.5

simon
2016-07-14 15:05
because no state transfer, etc.

simon
2016-07-14 15:44
`CORE_PEER_VALIDATOR_CONSENTER_ADDRESS=local-development-loopback-consensus`

simon
2016-07-15 13:07
looks for @kostas and @jyellick

kostas
2016-07-15 13:08
@simon: hello

kostas
2016-07-15 13:11
(Jason's out until Mon-Tue IIRC)

jyellick
2016-07-15 13:12
(correct, am out on vacation, unless it's something quick)

simon
2016-07-15 13:12
ah okay

simon
2016-07-15 13:12
stay vacating

simon
2016-07-15 13:23
so regarding pbft config persistence:

simon
2016-07-15 13:23
how does the config get persisted?

simon
2016-07-15 13:25
is the config stored as text and parsed?

simon
2016-07-15 13:25
or is it stored as protobuf and unmarshaled?

simon
2016-07-15 13:26
or is it stored as individual elements and converted from []byte to uint64, e.g.

kostas
2016-07-15 13:27
I'm tending towards protobufs, which I believe was option 3 when you asked a couple of days back. IIRC, you're leaning more towards the INI option?

simon
2016-07-15 13:28
well, if we use protobuf, we still need to get the config into the protobuf

kostas
2016-07-15 13:33
What's the workflow here? Edit the protobuf and run, or run using command-line flags, parse those and persist them into a protobuf?

simon
2016-07-15 13:39
exactly that's my question

simon
2016-07-15 13:39
seems that we would still read from some text config


kostas
2016-07-15 13:46
Would it be a bad idea (or as the OP wonders, "moronic") to have a `.proto` file with a `message Config` serve as our new `config.yaml`? Tending towards this even though it'll be less easy to edit than other options. I do not have a hard stand on the matter though.

steven.lebowitz
2016-07-15 18:13
has joined #fabric-consensus-dev

svr
2016-07-17 10:36
has joined #fabric-consensus-dev

simon
2016-07-18 15:40
all of the code parts are so interconnected

simon
2016-07-18 15:40
not happy about this

simon
2016-07-18 16:02
so how do i unify config setting in pbft

simon
2016-07-18 16:05
we have a set of peers, which should drive N

cbf
2016-07-18 16:06
on chain?

kostas
2016-07-18 16:06
What does "on chain"' mean?

simon
2016-07-18 16:06
cbf: no, consenters will run entirely separately

kostas
2016-07-18 16:07
Ah.

simon
2016-07-18 16:07
this is a pretty low level question

simon
2016-07-18 16:07
i have a set of peers that are used in `backend`

simon
2016-07-18 16:08
and pbft, which currently is created in `main`

simon
2016-07-18 16:08
so somehow `main` needs to update N?

simon
2016-07-18 16:08
or `backend` instantiates pbft

simon
2016-07-18 16:14
ah cute, that works

sergeybalashevich
2016-07-18 19:50
has joined #fabric-consensus-dev

viewer
2016-07-19 08:00
has joined #fabric-consensus-dev

simon
2016-07-19 12:46
anybody around?

simon
2016-07-19 13:11
jyellick: around?

jyellick
2016-07-19 13:11
Yep

simon
2016-07-19 13:11
i keep trying to figure out a way how to configure the consensus peer

simon
2016-07-19 13:11
like, create the initial configuration

jyellick
2016-07-19 13:12
Right

simon
2016-07-19 13:12
just on a commandline use level

simon
2016-07-19 13:12
and i can't figure it out

simon
2016-07-19 13:12
there are a few small restrictions

simon
2016-07-19 13:12
so i start with a `.ini` file

simon
2016-07-19 13:12
that sets all the pbft things

simon
2016-07-19 13:12
that works fine

simon
2016-07-19 13:13
but in addition to `N`, i need certificates and addresses for the peers

kostas
2016-07-19 13:14
just point it to a dir where these are stored?

simon
2016-07-19 13:14
stored how?

simon
2016-07-19 13:14
every peer is a tuple of (address, certificate)

kostas
2016-07-19 13:14
A generation step comes first.

simon
2016-07-19 13:15
well in a realistic scenario, some other operator would create a certificate, and then tell you the address the peer is on

simon
2016-07-19 13:16
and would hand you the certificate and address

simon
2016-07-19 13:16
possibly somebody would compile a configuration of all peers

simon
2016-07-19 13:16
like, an authoritative config file

jyellick
2016-07-19 13:16
I guess we do not want to require that the address is in the cert, ie the common name?

simon
2016-07-19 13:17
i didn't require this at the moment

simon
2016-07-19 13:17
right now to authenticate, i compare the certificiate that is sent

simon
2016-07-19 13:17
i don't know whether comparing a fingerprint is acceptable

simon
2016-07-19 13:18
so the problem basically is that certificates are usually stored in a separate file

simon
2016-07-19 13:18
which complicates things

simon
2016-07-19 13:18
because then i can't just store the ini in verbatim

simon
2016-07-19 13:18
but maybe i shouldn't do that at all

simon
2016-07-19 13:20
yea i guess that's too brittle

jyellick
2016-07-19 13:21
How are you enumerating the peers? A comma separated list? If you want to stick with something like ini which doesn't natively support multi-entry fields and such, I would be inclined to go with a nested directory structure.

simon
2016-07-19 13:21
well, exactly, how

jyellick
2016-07-19 13:21
You could simply have a 'peers' directory, and a directory for each peer, which includes a name and a certificate.

simon
2016-07-19 13:21
ignoring the ini format for a moment

simon
2016-07-19 13:21
peers are tuples of address and certificate

simon
2016-07-19 13:22
address, meaning IP/name, port

jyellick
2016-07-19 13:22
Here I am using name/address interchangeably

jyellick
2016-07-19 13:22
Yes

simon
2016-07-19 13:22
so my prototype is something like

simon
2016-07-19 13:23
```[peer "foo"] address = name:5111 cert = foocert.pem ```

simon
2016-07-19 13:23
which works good enough

simon
2016-07-19 13:24
just to initialize the local peer

simon
2016-07-19 13:24
i guess that works

simon
2016-07-19 13:24
then the question becomes: how do we store the pbft config in our consensus state

simon
2016-07-19 13:25
because the config should basically form the genesis hash, so that all replicas are configured the same way

jyellick
2016-07-19 13:25
So, do we have a plan for updating configuration at runtime?

simon
2016-07-19 13:26
i have a vague plan

simon
2016-07-19 13:27
basically assuming there is a protocol that allows submission of a new config, which has to be signed by enough replicas, goes through consensus, check signatures, apply configuration to state at some point, then restart application

jyellick
2016-07-19 13:28
How is this reflected in the local configuration? Are the config files essentially inert after bootstrapping?

simon
2016-07-19 13:28
that's why i want to put them into the state

simon
2016-07-19 13:29
and not operate from some random config file

jyellick
2016-07-19 13:29
I understand and like the simplicity of a plain config file to start, but I would be in favor of using whatever this runtime facility is to do the modification

simon
2016-07-19 13:29
you go `consensus-peer -init somefile.ini -data-dir foo`

jyellick
2016-07-19 13:29
Maybe start in a state of 'operate in N=1, f=0, listen only on a local socket, and wait for a config update'

simon
2016-07-19 13:29
and this creates `foo` and populates it with some internal representation of `somefile.ini`

jyellick
2016-07-19 13:30
And then use whatever mechanism it is that injects runtime config changes to do the bootstrapping

simon
2016-07-19 13:30
but then we would have to first develop the mechanism on how to change config during runtime

simon
2016-07-19 13:30
i don't think that's a low effort sequence

jyellick
2016-07-19 13:31
Yes, that's true. I just dislike the idea of multiple configuration paths. I suppose we could always migrate things later.

simon
2016-07-19 13:32
one is "initial setup", the other is "modify setup"

simon
2016-07-19 13:32
there always will be those two

simon
2016-07-19 13:33
anyways, how do we store our state?

simon
2016-07-19 13:33
especially, how do we store our config

simon
2016-07-19 13:34
the problem with grpc is that if we serialize it multiple times, we are not guaranteed the same output

simon
2016-07-19 13:34
i think practically right now there is

simon
2016-07-19 13:34
but it is not good practice to rely on it

simon
2016-07-19 13:35
should we store all pbft config settings in separate persist keys and (un)serialize them by ourselves?

simon
2016-07-19 13:36
then we have a defined format

jyellick
2016-07-19 13:36
Can you explain this?

kostas
2016-07-19 13:36
> the problem with grpc is that if we serialize it multiple times, we are not guaranteed the same output

kostas
2016-07-19 13:36
Not sure I follow.

kostas
2016-07-19 13:36
Ah.

kostas
2016-07-19 13:37
(Same here.)

simon
2016-07-19 13:38
nothing in the grpc spec says that the same structure will always be serialized the same way

simon
2016-07-19 13:38
fields could be reordered

simon
2016-07-19 13:39
well, protobufs

simon
2016-07-19 13:39
not grpc

kostas
2016-07-19 13:39
Got it, so hash will be different, etc.

muralisr
2016-07-19 13:40
@simon… sorry to jump in :slightly_smiling_face: …. but the grpc statement caught the eye

simon
2016-07-19 13:40
yes

kostas
2016-07-19 13:40
So you suggest timeout.request goes to a separate persist key, and the same process is followed for every PBFT key essentially.

simon
2016-07-19 13:40
kostas: that could be one way of doing it

muralisr
2016-07-19 13:41
do protofbuf folks specifically say that we cannot depend on structures being serialized the same way ?

muralisr
2016-07-19 13:42
typically we expect “backward compatibility” so we can dd new fields to the end…. but if we cannot rely upon ordering even among existing fields, that goes out of the window

simon
2016-07-19 13:42
nothing states that all implementations will always do it the same way

muralisr
2016-07-19 13:43
not saying you are wrong…. but it is surprising

simon
2016-07-19 13:43
unless there is a clear statement that this is a format requirement, i don't think we can rely on it

jyellick
2016-07-19 13:44
This is surprising to me, I thought the numbers associated with the protobuf fields indicated a required ordering (not saying you are wrong, just surprising)

muralisr
2016-07-19 13:45
^^^ ditto


jyellick
2016-07-19 13:50
"when a message is serialized its known fields should be written sequentially by field number"

jyellick
2016-07-19 13:50
It sounds to me like we can rely on field ordering when serializing, however decoding must support arbitrary ordering.

simon
2016-07-19 13:53
should

simon
2016-07-19 13:53
not MUST

simon
2016-07-19 13:53
or will

jyellick
2016-07-19 13:58
Yes, I suppose read with the RFC type meaning of 'should', then it is not guaranteed. This comes back to a possible runtime reconfiguration. If runtime reconfiguration is done by simply broadcasting an encoded protobuf of the config, then it seems like we have no problem?

simon
2016-07-19 13:59
correct

simon
2016-07-19 13:59
if just one person encodes it, all is fine

jyellick
2016-07-19 14:01
I'd be inclined to make the config generation a separate tool, parses your INI, spits out an encoded protobuf config, which is consumable initially only via some 'bootstrap with this config' param, and later via the runtime reconfiguration. Less intuitive to start, but seems more consistent to me.

muralisr
2016-07-19 14:02
I still think the “should” is misleading…. we can take it to the extreme and take it to mean two back to back serializations in the same transport can be serialized differently… would make the deserilization more complex and inefficient.

muralisr
2016-07-19 14:02
I think its worth confirming that the “should” is meant to be a “should"

simon
2016-07-19 14:03
what do you mean by misleading?

muralisr
2016-07-19 14:03
I think they could have wanted to mean would or must

simon
2016-07-19 14:03
jyellick: so use protobuf as internal config serialization?

muralisr
2016-07-19 14:04
did they put the should in bold ? as SHOULD ?

simon
2016-07-19 14:04
muralisr: what does it matter?

simon
2016-07-19 14:04
clearly it doesn't say WILL ALWAYS SRSLY YOU CAN TRUST US

muralisr
2016-07-19 14:04
it matters if it makes implementation convoluted

jyellick
2016-07-19 14:04
simon: That seems like the most direct path to me, and the easiest to extend down the road

muralisr
2016-07-19 14:05
if not, then it doesn't

muralisr
2016-07-19 14:07
btw way I meant the should in caps as they do typically in an RFC….it would add weight to that that intent

muralisr
2016-07-19 14:07
:slightly_smiling_face:

simon
2016-07-19 14:08
i don't think it is good software engineering practice to sloppily accept third party outputs as the foundation for a data structure that should (cryptograpically) last for years or decades

muralisr
2016-07-19 14:09
agreed

simon
2016-07-19 14:09
so our stable outputs need to be hand crafted

simon
2016-07-19 14:10
the wire formats can change more easily

simon
2016-07-19 14:10
okay, protobufs then

muralisr
2016-07-19 14:15
however backward compatibility is a big issue with all these wire format datastructures. …. not sure if they’d change that easily. There maybe a new “version” as in protobuf2 vs 3

simon
2016-07-19 14:15
@kostas @jyellick @tuand in preparation for the scrum, let's make a list of targets we need to reach, and break them down into tasks

muralisr
2016-07-19 14:15
anyway that’s my story

simon
2016-07-19 14:32
@jyellick: can we derive all these viewchange and request and resend timeouts from a single value?

jyellick
2016-07-19 14:33
I think we should

jyellick
2016-07-19 14:33
There is too much dependency between them all

jyellick
2016-07-19 14:33
I'd actually love to set them more heuristically, based on network performance, but that is a step for later

jyellick
2016-07-19 14:35
batch timeout < request timeout < null request timeout

simon
2016-07-19 14:35
okay, so we just define one, and calculate the others

jyellick
2016-07-19 14:35
Right

simon
2016-07-19 14:35
and just say null requests enabled: yes/no

kostas
2016-07-19 14:35
Request timeout is the one to go for I think, based on the issues I witnessed last Friday with the Bluemix service.

jyellick
2016-07-19 14:36
The code today I think defaults to: batch = 1/2 * request null = 3/2 * request

kostas
2016-07-19 14:36
It does.

simon
2016-07-19 14:37
i think we should define it based on some notion of "network diameter"

simon
2016-07-19 14:37
and use multiplicatives

kostas
2016-07-19 14:37
f(N) essentially?

simon
2016-07-19 14:37
no

kostas
2016-07-19 14:38
I'm listening.

simon
2016-07-19 14:38
running in a datacenter, you'd put it at maybe 1ms

simon
2016-07-19 14:39
running on multiple clouds, maybe 2s

simon
2016-07-19 14:39
or 5s

jyellick
2016-07-19 14:39
I still maintain that we should have an option to set these dynamically

jyellick
2016-07-19 14:40
Say, have a floating request timeout equal to 3 times the average request time to execution

jyellick
2016-07-19 14:40
(up to some ceiling or whatnot)

kostas
2016-07-19 14:40
Simon: Sure, but then you need to add logic to identify whether all the peers are in the same subnet, or whatever, right?

simon
2016-07-19 14:40
well, that's in the future

simon
2016-07-19 14:41
kostas: well, when you configure your network, you know that

simon
2016-07-19 14:41
i mean, this is the part where we poke through the "asynchronous network" abstraction

simon
2016-07-19 14:41
so i think the person designing the network should do it

simon
2016-07-19 14:42
it's just a matter of deriving the other timeouts from that one

kostas
2016-07-19 14:42
OK, my initial impression was that you were trying to do this behind-the-scenes, in code.

simon
2016-07-19 14:42
oh no

simon
2016-07-19 14:42
what i want is one timeout, and the other ones are derived from it

simon
2016-07-19 14:42
just to simplify configuration

simon
2016-07-19 14:42
but we can tackle that later

tuand
2016-07-19 14:45
so just to get started on targets like @simon mentioned ... from the various conversations from last week ...

tuand
2016-07-19 14:45
separate out consensus service

tuand
2016-07-19 14:45
"solo" consensus

tuand
2016-07-19 14:46
separate consensus using pbft and others (raft, kafka , etc ... )

tuand
2016-07-19 14:46
i don't know what we want to do with endorsers ??

tuand
2016-07-19 14:46
what other targets ?

simon
2016-07-19 14:47
i would say endorsement stuff is still being worked on, and at least for now we can't take this into consideration

jyellick
2016-07-19 14:48
I think we should include a target for getting the consensus code out of the main fabric repo

simon
2016-07-19 14:48
okay

simon
2016-07-19 14:48
more specifically, we need:

simon
2016-07-19 14:49
- "state" transfer between consensus nodes (can just be the checkpoint "hash" contents)

simon
2016-07-19 14:49
- persisting raw log

simon
2016-07-19 14:50
- reconfiguration

simon
2016-07-19 14:51
- signatures on batches/checkpoints

tuand
2016-07-19 14:51
i'll add as target : dynamic addition of consensters, committers

simon
2016-07-19 14:51
yea, that would be reconfiguration

simon
2016-07-19 14:51
committers are just clients

simon
2016-07-19 14:51
but some of them need to persist the log "for sure"

simon
2016-07-19 14:52
i.e. we cannot advance without these peers having confirmed reception

simon
2016-07-19 14:52
or we need to persist ourselves

kostas
2016-07-19 14:54
Going back to the grpc ordering guarantees discussion for a sec, this is what the `protoc` release notes of v3-b4 (released 17h ago) write:

kostas
2016-07-19 14:54
> The deterministic serialization is, however, NOT canonical across languages; it is also unstable across different builds with schema changes due to unknown fields. Users who need canonical serialization, e.g. persistent storage in a canonical form, fingerprinting, etc, should define their own canonicalization specification and implement the serializer using reflection APIs rather than relying on this API.


kostas
2016-07-19 14:54
So Simon's right.

tuand
2016-07-19 14:57
registration of committers ... is this something we handle as part of consensus service ?

muralisr
2016-07-19 14:58
ah good. thanks @kostas that does clear it

simon
2016-07-19 14:58
what would be registration of committers?

simon
2016-07-19 14:59
@kostas: :slightly_smiling_face:

simon
2016-07-19 14:59
oh, protobufs also generates a json description

simon
2016-07-19 14:59
so there is a json serialization format

simon
2016-07-19 14:59
we can use this for initial config ingress

jyellick
2016-07-19 14:59
I like the human readability aspect of that

tuand
2016-07-19 15:46
I threw together this page https://github.com/hyperledger/fabric/wiki/Consensus-work-items-for-next-Architecture-proposal . Add your items today/tomorrow then we can use the list to prioritize our work

yajneshrai
2016-07-20 08:59
has joined #fabric-consensus-dev

yajneshrai
2016-07-20 09:10
Hello. I have a question on new consensus architecture. In a blockchain network, can we have all the peers as endorsing peers and omit submitting peers? Are these peers just the roles that can be played by any node? (I mean can a submitting peer play the role of other nodes as well?)

simon
2016-07-20 10:14
yes

yajneshrai
2016-07-20 10:33
@simon: May I know which of the above statements are true, to be more precise?

simon
2016-07-20 10:34
any machine can run multiple node types

yajneshrai
2016-07-20 10:38
okay thank you @simon ! Any comment on first question, regarding omitting all the submitting peers and keeping only endorsing peers?

simon
2016-07-20 10:40
well you need somebody to submit

yajneshrai
2016-07-20 10:41
yeah got it

yajneshrai
2016-07-20 10:43
@simon: Have you got any idea about the limit for chains that can be present in a blockchain network? (The max no of chains in a network)

simon
2016-07-20 10:46
i don't understand

simon
2016-07-20 10:47
the design is single chain

yajneshrai
2016-07-20 10:48
Is there any notion of maintaining multiple chains to keep confidentiality against other peers?

simon
2016-07-20 10:50
there is talk about it, but no clear design yet

simon
2016-07-20 10:51
do you have suggestions?

yajneshrai
2016-07-20 10:56
Yes. For a particular group of business entities(peers) create a separate chain where only they can maintain their relevant transactions. But I feel this would lead to an overhead if there are 100(or more) separate group of peers.

simon
2016-07-20 11:12
it is a difficult thing, yes

simon
2016-07-20 11:13
what does it mean to create a separate chain? maintain own set of consensus nodes as well?

yajneshrai
2016-07-20 12:14
Just endorsing peers will change, and consenters can remain same as they do not keep the ledgers

yajneshrai
2016-07-20 12:20
@simon: Is PBFT going to be replaced by XFT in the next consensus model?

simon
2016-07-20 12:21
we want to make consensus exchangable

yajneshrai
2016-07-20 12:24
What does exchangable mean?

simon
2016-07-20 12:27
ideally there will be pbft, and some sort of PoW, some crash-fault tolerant

yajneshrai
2016-07-20 12:31
But in the new proposal it is mentioned that PBFT will be completely replaced in the architecture.


tuand
2016-07-20 13:05
interesting ... https://github.com/hyperledger/fabric/issues/2262 ... I've asked Dongming to post logs

simon
2016-07-20 13:09
why are people storing megabyte files in the blockchain?

simon
2016-07-20 13:09
what is going on?

tuand
2016-07-20 13:10
i think they're testing effects of different payload sizes

simon
2016-07-20 17:18
peers are down?

simon
2016-07-20 17:19
what kind of test is this?

jyellick
2016-07-20 17:25
It is well known that large payloads require timeouts to be turned way up.


simon
2016-07-20 17:38
does this look remotely reasonable?

jyellick
2016-07-20 17:39
(looking)

jyellick
2016-07-20 17:41
So, I'm not sure I'm sold either way, but I think there's definitely a tradeoff using a map vs. explicitly enumerating the config parameters in the protobuf definition

jyellick
2016-07-20 17:42
The map means the proto stays constant and you don't have to worry about forward/backward compatibility at the proto level, but it also seems far more opaque

simon
2016-07-20 17:43
yes

simon
2016-07-20 17:43
the map is internal

simon
2016-07-20 17:43
initial config is done via json and does not do a map

simon
2016-07-20 17:44
we may want a `pbft.CheckConfig(c *BatchConfig) error`

simon
2016-07-20 17:44
also maybe something with defaults

simon
2016-07-20 17:44
my test config looked this way:


jyellick
2016-07-20 17:52
@jyellick uploaded a file: https://hyperledgerproject.slack.com/files/jyellick/F1THGPCEB/two_phase_consensus.pdf and commented: Two Phase Consensus Illustrated

simon
2016-07-20 17:52
maybe the timeouts should be strings

jyellick
2016-07-20 17:53
@kostas @simon See above

simon
2016-07-20 17:53
yea that is one way of doing that with total order broadcast

jyellick
2016-07-20 17:54
I don't see why it strictly requires total order broadcast, if you allow forking

jyellick
2016-07-20 17:55
From the broadcast to deliver step, it looks exactly like bitcoin to me. Assuming you use gossip without ordering for the network box

jyellick
2016-07-20 17:57
With total ordering, then you can guarantee exactly one correct block comes out.

simon
2016-07-20 17:57
yes

simon
2016-07-20 17:57
and you don't need to order the TX before

jyellick
2016-07-20 17:57
Well, I'd argue you are getting an implicit TX order

kostas
2016-07-20 17:58
Just came back, looking now.

simon
2016-07-20 17:58
so (4) and (5) could be using gossip

simon
2016-07-20 17:58
while (6) and (7) use total order broadcast

jyellick
2016-07-20 18:00
I think you could. Though by having ordering at 5, you can guarantee everyone builds the same block, which I think would improve the efficiency of 7

kostas
2016-07-20 18:08
So this separation between "Consenter" and "Network Orderer" in this diagram throws me off a bit. You're basically saying that on the Consenter level, you have TXs coming in, candidate blocks being created internally, which are then validated and delivered.

kostas
2016-07-20 18:13
I guess I agree, but I'm not sure what's new here, compared to the rest of today's discussions. Not being blunt, I'm trying to realize if there's a subtle difference that I'm missing.

jyellick
2016-07-20 18:13
Yes, so I've deliberately called that a 'consenter' even though it really does nothing for 'consensus' from an ordering perspective. Basically, we keep shoving 'ordering' and 'consenting' together, but I don't think people are great at articulating that they do not want this. From a bitcoin perspective, I'd argue that their 'consenters' actually don't do ordering. Ordering is done by looking at a collection of blocks, and picking the longest chain. Using PBFT, we are exploiting ordering, to deterministically create blocks, but that's a different sort of 'ordering'.

jyellick
2016-07-20 18:17
So yes, I'd say a blockchain is ordered, because blocks have numbers, and 'validity'. Under some schemes, like PBFT, it will be impossible to generate multiple valid blocks for the same block number. Under other schemes, like PoW, it may be possible to generate multiple valid blocks for the same block number, but there is an incentive not to. But fundamentally, consenters create blocks and determine whether a block is valid, not what order the blocks go in.

jyellick
2016-07-20 18:30
@kostas With regards to what's new here, is that from Binh's picture, the consenter / orderer are the same, and then there's this external 'validator' thing, which I think is wrong. I think the consenter / validator are one, and the orderer is the external thing (and could more appropriately be called 'network'). In the permissioned PBFT mode, the consenter/validator can exploit the total ordering properties of the network to decide what blocks are valid, but they do not do ordering.

liewsc
2016-07-21 01:36
has joined #fabric-consensus-dev

cca
2016-07-21 07:22
@jyellick: interesting diagram ... but can you perhaps add some API descriptions? i dont follow. would the endorsers work as described in next-consensus-architecture? if yes, which box filters those tx after "consenter" (where there is total order) violate version dependencies? why loop twice from "consenter" to "orderer" ?

kostas
2016-07-21 10:55
@cca: Jason will expand more, but in the meantime maybe it'll be a bit clearer if you think of "consenter" as a "validator". (At least it did to me.) The first pass through the consenter gives you valid transactions (i.e. those read the read the right version of a key and propose a valid changeset), the second pass gives you valid blocks (so it makes sure there are no conflicting transactions in the same block, and filters out transactions that may have become stale).

simon
2016-07-21 13:21
hi guys

tuand
2016-07-21 13:21
hey simon

simon
2016-07-21 13:22
managed to catch up with everything?

tuand
2016-07-21 13:22
eh ...

tuand
2016-07-21 13:22
reading through our task list

tuand
2016-07-21 13:23
idk ... is there enough details for us to start prioritizing a bit ?

tuand
2016-07-21 13:24
also, binh is going to send out some proposals for v2 ... he wants to show the community on monday's arch group call

simon
2016-07-21 13:26
okay

tuand
2016-07-21 13:30
in the mean time , i'm going to take a quick look at #2262 ... go totally sidetracked yesterday

simon
2016-07-21 13:38
i think it isn't worth it

simon
2016-07-21 13:39
i'd appreciate it if we could decide on whether we want to go with my separate consensus approach

simon
2016-07-21 13:39
and if so, work on getting it into the main repo

simon
2016-07-21 13:39
or figure out what we need to do to get it into the repo

kostas
2016-07-21 13:39
My vote on this is known, I believe.

tuand
2016-07-21 13:42
ok ... i say we go for it and get a PR to get into master ... what's needed ?

kostas
2016-07-21 13:43
I'm all for the work in your branch, but I disagree with the main repo approach.

kostas
2016-07-21 13:44
I suggest we work in your branch until the powers that be realize that a separate consensus branch is the way to go.

simon
2016-07-21 13:44
well in the end we need to get it in

simon
2016-07-21 13:44
in one form or another

jyellick
2016-07-21 13:45
@cca I'll put together a better diagram. My key thought was that we should split block creation from ordering consensus. Basically, Bitcoin uses PoW puzzle solving as its primitive for this block is next, but we can use a 'total order broadcast' for our primitive as to what block is next. In this way, the flow for the permissioned BFT path and the probabilistic PoW/PoET paths become the same, each utilizing a different primitive to determine block validity.

tuand
2016-07-21 13:46
and we have to get the community aware of and using simon's branch

simon
2016-07-21 13:46
yea that's silly

tuand
2016-07-21 13:47
which is silly ?

tuand
2016-07-21 13:53
fyi #2262 ... looks to be a bug in chaincode

kostas
2016-07-21 13:56
@jyellick: Spent some more time thinking about this yesterday. How would you say that this is different from the concept of bringing validation to the consenter level in the NCAP work, as DAH had suggested? The only difference that I see is the first pass through the consenter/validator to filter out potentially invalid transactions. But you can maybe argue that you'll take the toll on ordering invalid transactions because you'll filter them on the way out anyway. Is there another difference?

jyellick
2016-07-21 13:56
I'm not very familiar with the NCAP work / DAH's suggestion

kostas
2016-07-21 13:58
NCAP is the new consensus architecture proposal. DAH's suggestion is to have the consenters validate the blocks before they get emitted, instead of having the committers do the validation.

jyellick
2016-07-21 14:06
Ah okay, sorry, didn't catch the acronym. I think it's all pretty similar, probably more semantic than anything. The issue about modifying the NCAP and emitting only 'valid blocks' is that you lose the chain aspect, as once you've pruned the block, the hash chain from consensus is broken. But, you could certainly build a new chain, and, if you even added some sort of gossip of block signatures at that step, then I'd argue they're very similar. I still think the right first step is to go to the consenter/validator, and not the orderer, because not all consensus mechanisms require that ordering step.

jyellick
2016-07-21 14:07
@simon Assuming your comment was about my remark, certainly open to criticism, I can see some downsides, but it doesn't seem obviously broken to me, would like to hear your thoughts

simon
2016-07-21 14:08
my comment is about developing on our own branch

simon
2016-07-21 14:08
and telling the community to use that branch

simon
2016-07-21 14:09
regarding the design, can we do API design first?

simon
2016-07-21 14:09
(while thinking about how different implementations would work)

jyellick
2016-07-21 14:10
Definitely

kostas
2016-07-21 14:13
@jyellick: Gotcha. RE: "But, you could certainly build a new chain, and, if you even added some sort of gossip of block signatures at that step, then I'd argue they're very similar." Inevitably you would be collecting signatures on the new block (that's linked to the previous one) during the validation in the NCAP+DAH design, so yeah, I think the two approaches would overlap.

jyellick
2016-07-21 14:14
I think the block validation strategy could optionally depend on ordering as well.

jyellick
2016-07-21 14:15
If the rule is, you require f+1 signatures for a valid block, for instance. That's a broken rule in a 3f+1 network, because you can have two sets of f+1 signatures. If however, you require the "first set of f+1 signatures", then you're back to only one valid block.

jyellick
2016-07-21 14:15
Of course, if you require 2f+1 signatures, then it should be safe without ordering.

simon
2016-07-21 14:16
what are you talking about?

kostas
2016-07-21 14:17
@jyellick: I agree.

jyellick
2016-07-21 14:17
@simon If the validators/consenters (not the ordering consenters) are building the blockchain themselves, by pruning bad transactions, creating a new block, and signing that block, then they must have some sort of threshold for the number of required signatures before a block becomes valid.

simon
2016-07-21 14:18
yes

simon
2016-07-21 14:19
how can we describe this independent of the consensus method

simon
2016-07-21 14:19
because PoW will be different

jyellick
2016-07-21 14:19
In some schemes, a simple threshold of "> n" might be sufficient, but, you could imagine a scheme which depends on the 'first to > n', which requires fewer signatures.

jyellick
2016-07-21 14:22
So, for the validator/consenter (not ordering consenter) [we should really come up with a formal name for this], I would think the API is pretty naturally defined. You have two api points of ingress, one accepts new transactions, the other accepts proposed blocks (or block representations, maybe hash+signature). Then, you have a single defined point of egress which is 'valid blocks'.

simon
2016-07-21 14:23
how do the blocks get proposed?

jyellick
2016-07-21 14:23
In the PoW mechanism, the ingress and egress are both done via gossip, trans come in, you try to solve a puzzle, and valid blocks go out. If someone else solves the puzzle first, you get that block coming in (from someone else's valid block egress) commit it, and you work on the next block.

jyellick
2016-07-21 14:25
In the PBFT mechanism, the ingress of trans is unordered, we then use our atomic broadcast ordering primitive (equivalent to the puzzle solving step) to give us an ordered list of trans, and we create a valid block, sign it and send it out. Gossip would be fine here depending on the threshold scheme. Once we receive enough valid signatures, we consider that block to be valid and commit it.

kostas
2016-07-21 14:25
(RE: formal names. The consenter term has been so overloaded that I honestly propose we get rid of it, at least internally. I read a doc yesterday that talked about "block adders" and thought to myself, "well, at least it gets the message across". So validators, and orderers work for me.)

jyellick
2016-07-21 14:26
(Great, I'll use that)

simon
2016-07-21 14:26
but the "send it out" would be internal to that specific pbft consensus cloud

kostas
2016-07-21 14:26
> Gossip would be fine here depending on the threshold scheme. Here you mean among the orderers?

simon
2016-07-21 14:27
so i guess my questions are:

jyellick
2016-07-21 14:27
@kostas I mean among the validators

simon
2016-07-21 14:27
1. how do we define what "right set of signatures" is

simon
2016-07-21 14:27
or rather, "cryptographic proof"

simon
2016-07-21 14:27
which is different between consensus models

simon
2016-07-21 14:28
crash fault tolerant doesn't have any

simon
2016-07-21 14:28
PoW has one crypto puzzle

simon
2016-07-21 14:28
bft has 2f+1 signatures of replicas (at that time)

simon
2016-07-21 14:29
so clearly the "proof" is specific to the consensus method chosen

jyellick
2016-07-21 14:29
1. Yes, these I've been handwaving at with "There are a bunch of ways you could do this", which I do think is true. As a proof of concept first step sort of thing, I would say we use the PBFT scheme. Given 3f+1 validators, require 2f+1 signatures before a block is valid.

kostas
2016-07-21 14:30
And on top of that, another thing that I believe Christian mentioned yesterday, and threw me off. How do you prove that the signature you have for block X did indeed belong to a valid-at-the-time validator. What if the validator wasn't authorized to be a validator back then?

kostas
2016-07-21 14:31
You probably solve this by recording on the blockchain who the validators are during an epoch.

jyellick
2016-07-21 14:31
I would say timestamps solve that somewhat nicely, if the signature is for something before the cert became valid (was issued), then it is not a valid signature.

jyellick
2016-07-21 14:31
Ah, yes, recording that on the blockchain might be cleaner

kostas
2016-07-21 14:32
I'm thinking a "from now on, and until we record another block suggesting otherwise, A-B-C-D are the approved validators" block on the chain.

kostas
2016-07-21 14:32
Right.

jyellick
2016-07-21 14:33
This is a hard problem, though I would maintain that it's probably easier to come up with a correct validator signature scheme, than to make PBFT dynamically add orderers.

jyellick
2016-07-21 14:33
(If, for our first pass, we are interested in adding and removing validators, but not orderers)

jyellick
2016-07-21 14:36
Alright, going to hop offline for a few minutes as I make my way to the main site, back in a bit

cca
2016-07-21 15:01
@jyellick - to understand better, you want to accommodate BFT and PoW consensus in this one framework? this will be quite difficult because PoW is not final, while BFT is...

jyellick
2016-07-21 15:12
@cca So certainly, PoW can create forks, while BFT does not. So, to me the difference is, in BFT there is exactly one possible valid block for each block number, while in PoW there may be multiple. So, the idea would be that a 'validator' simply creates/validates blocks, and the actual mechanism for determining the content of the block (whether it be ordering, or puzzle solving) would plug into the validator.

jyellick
2016-07-21 15:21
I know that @sheehan has been looking to make the ledger support forks, so I'm not sure what else would make the support of both so challenging?

simon
2016-07-21 17:32
hi

simon
2016-07-21 17:33
i think people took over that room

simon
2016-07-21 17:34
do we have an API surface thing?


binhn
2016-07-21 22:41
@jyellick: could you post the new picture here

jyellick
2016-07-22 00:42
I'm not sure which new picture you mean? We plan on writing something more formal up soon that is the hybrid option 2 / 3 as described

kostas
2016-07-22 00:51
(Jason: Probably the drawing that Tuan had on the whiteboard which I believe wasn't captured because of the upcoming write-up.)

binhn
2016-07-22 01:25
I thought we split out the validator into 2 components, so

binhn
2016-07-22 01:25
I like to see those components in a diagram


simon
2016-07-22 11:59
@jyellick: does that look like what you had in mind?


simon
2016-07-22 12:00
the proto sketch is a bit unsatisfying

simon
2016-07-22 12:00
also the gossip needs more formalizing

jyellick
2016-07-22 14:08
@simon: I think that's roughly what we described. The thing that worries me in this picture is that 3/5 both seem to be some sort of network which seems like unfortunate duplication

binhn
2016-07-22 14:34
could it be labeled like this Validator = Consenter, Validator2 = Validator ?

binhn
2016-07-22 14:36
i am not clear on how step 4 comes about — when would a batch become a block? i thought that must be part of step 6

jyellick
2016-07-22 14:37
So, option 2 smashes consensus together with the "proof of work" or "atomic broadcast" boxes. Option 3 smashes the "consensus" box with the "validator" box. And this diagram shows them both separated.

jyellick
2016-07-22 14:39
I think step 4 is appropriate where a batch becomes a block. Step 6 is doing whatever it is to validate the block to accept it

jyellick
2016-07-22 14:39
(For other validators)

jyellick
2016-07-22 14:40
In the PoW case it would be verifying the crypto puzzle. In the atomic broadcast case, it would be checking the signature and verifying enough signatures.

binhn
2016-07-22 14:42
ok, so this picture contains both — that was the source of my confusion; i’d like to draw a picture like your option 3 without pow vs ab

jyellick
2016-07-22 14:44
Ah, yes, we can do that

jyellick
2016-07-22 15:07
@jyellick uploaded a file: https://hyperledgerproject.slack.com/files/jyellick/F1U8TJ9TQ/consensus_arch.pdf and commented: Orderer only proposed arch

jyellick
2016-07-22 15:08
@simon: @binhn @kostas @jeffgarratt @tuand ^^

jyellick
2016-07-22 15:15
The picture got a little crowded with all the detail, hopefully it's still clear. Of note, steps 8 and 9 take place on both receiving consensus plugins, but that was the end of the diagram and it was getting tight, so I only listed one on each plugin to validator path.

jyellick
2016-07-22 15:23
(@cca Did not tag you because I thought you were unavailable today, see you are there, would welcome your feedback, I know you are skeptical)

cca
2016-07-22 15:23
so, let me see if i get this right: in the pdf, "consensus plugin" is only an interface, which is implemented by "atomic broadcast"; what is "consensus plugin network"? and what is "validator"?

cca
2016-07-22 15:23
thanks - not available for calls, indeed

cca
2016-07-22 15:23
for a brief here and then, yes

jyellick
2016-07-22 15:24
No, the "consensus plugin" is a piece of code which has a policy for determining block validity. So, the 'atomic broadcast service', runs somewhere, and gives us the promise that it will fairly delivery a stream of these transaction batches to everyone in the same order. This could be PBFT, could be Kafka, etc.

jyellick
2016-07-22 15:25
The "consensus plugin" takes this atomic ordered broadcast stream, and generated blocks (deterministically) from it

jyellick
2016-07-22 15:25
At this stage, every consensus plugin can generate the same block, with the same trans, and I think we are exactly at the old design

jyellick
2016-07-22 15:25
However, people really really seem to want to support consenting on the 'validated block', so we add an extra round of sharing block signatures, and accumulating enough to commit it.

cca
2016-07-22 15:26
ok, this is what i meant earlier, but here it sort-of wraps around the actual communication with atomic broadcast

cca
2016-07-22 15:26
extra round sounds like a waste

cca
2016-07-22 15:26
- but what is the validator?

jyellick
2016-07-22 15:26
So a validator is the thing that actually understands how to apply transactions

cca
2016-07-22 15:27
so, the logic with versions and such?

jyellick
2016-07-22 15:27
Right, the consensus plugin can treat trans as opaque blobs, and call into the validator to remove conflicts and bad trans etc.

cca
2016-07-22 15:27
(i wouldn't call this validator, then, because of overload; it is only the "filterer")

cca
2016-07-22 15:27
(you need a new term)

jyellick
2016-07-22 15:28
Yes, I think it's a pretty simple object, and, I'll be curious to see if anyone actually wants to separate it from the consensus plugin in real world deployments

cca
2016-07-22 15:28
that consensus plugin would be one interface, where tx are submitted, and which outputs the "filtered" tx at the end, or?

jyellick
2016-07-22 15:29
The consensus plugin makes calls into the 'validator/filterer' to convert a batch of txs into a block

cca
2016-07-22 15:29
so behind this consensus plugin, it hides the state-agnostic atomic broadcast plus the stateful (blockchain state) filterer/validator

cca
2016-07-22 15:30
i find the pdf a bit confusing because 2 and 7/8/9 are not clear whether they are on the same node or on different ones

jyellick
2016-07-22 15:31
Yes, sorry, that PDF is crowded, maybe I should take another stab at it

cca
2016-07-22 15:31
i need a pic that shows what happens on one node, then how they link to others (which boxes talk to others)

cca
2016-07-22 15:31
2 would be better...

cca
2016-07-22 15:31
(i did not understand this here)

jyellick
2016-07-22 15:32
I will try to put together a better picture later, though may just whiteboard it this afternoon and take a picture

kostas
2016-07-22 15:32
I cannot understand the PDF anymore either FWIW

cca
2016-07-22 15:32
ah, good. i'm ok with what you just described.

jyellick
2016-07-22 15:37
Great. Yes, I think none of our discussions have really been a radical departure, we still end up with what is essentially a 'raw log', but, instead of stopping there, we form a new chain and share some signatures. Since everyone has the same raw log and the same deterministic rules, everyone will produce the same block, so that last consenting on block step is a little silly, but, it seems to make some people feel better, and it aligns nicely with the probabilistic consensus methods if we ever decide to plug one in.

simon
2016-07-22 15:42
so what is wrong with my picture?

ray
2016-07-22 15:42
has joined #fabric-consensus-dev

jyellick
2016-07-22 15:43
Binh wanted a picture without the PoW option in it

jyellick
2016-07-22 15:44
I also removed the validator to validator communication in the new one and had it all go through the plugin

jyellick
2016-07-22 15:45
There were a couple other things I noticed while trying to transcribe yours, 3b. `Broadcast` can't deliver a block, just a `[]tx`

jyellick
2016-07-22 15:46
Ah, I see, there is the difference

jyellick
2016-07-22 15:47
In 2 you call `CreateBatch`, but I would say that doesn't happen. You just deliver your received txs into the orderer service, and get back a batch of tx.

jyellick
2016-07-22 15:48
I don't think you can push blocks into the orderer, because blocks have numbers and previous block hashes, and so you get a race where a fast node can be the first to propose a block and only ever include their desired txs

jyellick
2016-07-22 15:48
If you push txs into the orderer and get batches or a stream back (which you deterministically turn into batches), you eliminate that problem,


simon
2016-07-22 15:49
how about this?

simon
2016-07-22 15:49
ignore the first slide

simon
2016-07-22 15:49
ah i see


simon
2016-07-22 15:54
what about now?

jyellick
2016-07-22 15:55
I don't see 2?

jyellick
2016-07-22 15:56
I would think 4 returns a block

simon
2016-07-22 15:56
ah yes

simon
2016-07-22 15:56
renumbering -_-

jyellick
2016-07-22 15:56
5 is a gossip net between validators, which you could do, but I thought since we needed sigs for ab and blocks for PoW, we wanted it attached to the plugin

simon
2016-07-22 15:57
well

simon
2016-07-22 15:57
that's why it is between validators

jyellick
2016-07-22 15:58
And between 5/6 I would say there's a plugin validation step as to whether the signature threshold has been reached

simon
2016-07-22 15:58
because it is a common function

jyellick
2016-07-22 15:58
I agree, it could go either place. My issue with your very first picture was that it was both.


simon
2016-07-22 16:11
how about now?

simon
2016-07-22 16:12
we can also move (6) gossip down

simon
2016-07-22 16:12
and make it consensus implementation specific

simon
2016-07-22 16:12
clearly (6) can be implemented by consensus anyways

jyellick
2016-07-22 16:12
2. CreateBlock should be 5.

simon
2016-07-22 16:13
sorry, copy+paste to get a nice ODC compliant slide

jyellick
2016-07-22 16:13
And there's an extra 8, but otherwise I think it's looking pretty good

simon
2016-07-22 16:16
aaand libreoffice threw away a slide

jyellick
2016-07-22 16:20
I was wondering why it was pptx.... thought you were on Linux

jyellick
2016-07-22 16:20
(why I kept posting pdfs instead of odps)

simon
2016-07-22 16:20
i figured windows people could use pptx better


simon
2016-07-22 16:21
you see that validateblock is gone

simon
2016-07-22 16:21
okay, i'll head out for a bit

jyellick
2016-07-22 16:26
Yeah, that looks good to me now, thanks Simon

simon
2016-07-22 17:55
cool

mtakemiya
2016-07-23 11:55
Is PBFT Sieve a part of the hyperledger fabric distribution, or was it removed in the transition from openblockchain to fabric?

tuand
2016-07-23 17:31
sieve was part of fabric but we decided to concentrate on batch pbft in the current codebase. We're also working with a new architecture to decouple consensus from the other components and allow for plugin of other consensus protocols. https://github.com/hyperledger/fabric/wiki/Next-Consensus-Architecture-Proposal

mtakemiya
2016-07-24 23:46
What is the difference between Batch PBFT and Sieve? Did the fabric implementation of Sieve only work with one operation at a time?

tuand
2016-07-25 01:30
in batch bpft, fabric will batch ( batch size set in hyperledger/fabric/consensus/pbft/core.yaml ) transactions before sending the batch through consensus. Sieve works on one transaction at a time and adds an extra step to verify the result of the transaction.

mtakemiya
2016-07-25 02:47
Okay, that makes sense now

virajkamat
2016-07-25 04:18
has joined #fabric-consensus-dev

huxd
2016-07-25 10:47
has joined #fabric-consensus-dev

simon
2016-07-25 12:43
@jyellick, @kostas: i think we overlooked one aspect: the `ValidateBlock(block) block` function looks like it is stateless, but it can't be

kostas
2016-07-25 13:05
Right. I made that point (w/o being aware that statelessness was implied previously) during the meeting on Friday.

kostas
2016-07-25 13:06
Since you couldn't make Friday's meeting: We concluded that the validation-via-gossip function is the one part we may want to keep from last week's discussions.

kostas
2016-07-25 13:06
So it's NCAP with this optional module built either on the consenter or the committer level. Binh will be presenting these options to the community for further discussion.

simon
2016-07-25 13:07
what is NCAP?

kostas
2016-07-25 13:08
It is the acronym Marko uses for the new consensus architecture proposed in the Wiki.

simon
2016-07-25 13:08
oh

simon
2016-07-25 13:08
so how does the architecture look now?

simon
2016-07-25 13:08
different from my pptx?

kostas
2016-07-25 13:12
Will have to load your PPTX when I'm on the laptop and will get back to you. (About to leave for work now.)

simon
2016-07-25 13:12
ok

simon
2016-07-25 13:12
was that meeting some time late?

kostas
2016-07-25 13:12
@binhn Can we post the slides with the suggested designs here?

binhn
2016-07-25 13:13
i simplified the chart as follow (more on architecture side) to talk with the community


kostas
2016-07-25 13:13
(@simon: 1pm-5pm EDT IIRC, we had the dial-in on for you.)

binhn
2016-07-25 13:13
and the tx flow becomes


simon
2016-07-25 13:15
so why did we throw away what he had been discussing the days before?

simon
2016-07-25 13:16
how does this work with PoW?

kostas
2016-07-25 13:23
This question I never quite got during that discussion last week. With the current NCAP design (w/ zero modifications), what prevents the consenters from picking up the next block to go via PoW?

kostas
2016-07-25 13:25
Or let's consider PoET. The consenters receive transactions from the submitting peers, the PoET wizardry kicks in, one of them wakes up first and pushes the batch it has collected to the consenters.

jyellick
2016-07-25 13:50
@simon The biggest piece I think, was whether to deal with validation in the 'c-service'. It was eventually concluded that that's a non-starter, so the validation split didn't make sense. For those who would like an extra layer of consensus on the block output, we can do that between peers via gossiped signatures or whatnot.

jyellick
2016-07-25 13:51
Per @kostas it also simplifies things drastically, if we simply have PoW or PoET output batches and _not_ blocks. Then it behaves exactly like all the other ordering services, and there's no need to modify the gossiped signatures etc.

jyellick
2016-07-25 13:54
Certainly for something like bitcoin, it makes sense to have the miners operate on only 'valid transactions', because of the economic incentive, but, since we do not require a currency transaction here, it makes no sense here. I'd also argue it's shockingly like the bitcoin model, because for those applications which push a hash onto the chain, absolutely 'invalid hashes' may make it onto the chain.

simon
2016-07-25 15:36
so what's the plan - what should i work on now?

simon
2016-07-25 15:36
i somehow wanted to implement the proposed API, but it is unclear whether this is a waste of time or not

jyellick
2016-07-25 15:37
I think regardless of what we end up doing, the separated 'ordering' interface is something that's in every picture?

jyellick
2016-07-25 15:37
(the thing that brings in txs, and outputs batches)

jyellick
2016-07-25 15:37
Have you already done that?

simon
2016-07-25 15:38
yes


jeffgarratt
2016-07-25 15:57
@simon is this proto in a branch?

simon
2016-07-25 15:57
no

simon
2016-07-25 15:58
just hacked it so that we can talk about it

jeffgarratt
2016-07-25 15:58
no worries, will load manually

simon
2016-07-25 15:58
i didn't check whether it actually compiles

jeffgarratt
2016-07-25 15:58
understood

jeffgarratt
2016-07-25 15:58
will get the gist

jeffgarratt
2016-07-25 16:04
@simon: @jyellick Any way we could discuss this in a hangout ?

jeffgarratt
2016-07-25 16:05
if we feel prepared to

jyellick
2016-07-25 16:05
I'm sitting on a mostly vacant line waiting for the weekly consensus/security meeting to start

jeffgarratt
2016-07-25 16:05
ohhhh

jeffgarratt
2016-07-25 16:05
:slightly_smiling_face:

simon
2016-07-25 16:05
jyellick: i don't have an invite for that?

jyellick
2016-07-25 16:05
But would love to after

jyellick
2016-07-25 16:05
You are listed as optional for it in mine (@simon)

simon
2016-07-25 16:05
aha!

jyellick
2016-07-25 16:06
But it's just @tuand and I waiting, thinking it is not happening

jeffgarratt
2016-07-25 16:06
if you shoot to me, I can join if you wish to discuss it there


cbf
2016-07-25 16:43
someone please correct me if I am mistaken but didn’t we deprecate pbft-classic?

simon
2016-07-25 17:05
we did

simon
2016-07-25 17:15
issue with this model: - how does the (atomic broadcast) consenter know that its validator is caught up with the blockchain?

binhn
2016-07-25 17:50
ok, here is the very early draft of what we talked about this morning. Obviously I ll continue to work on this and with community tomorrow, but appreciate any comments


binhn
2016-07-25 18:30
say, after block validation, the peer decided to discard the block, i think the peer wouldn’t have to communicate that with the consenters since they are stateless — agreed ?

jyellick
2016-07-25 18:34
I think we are re-evaluating this, but not done. The goal is for a stateless consenter, but I'm not entirely sure if it's possible. The problem is that if the consenter does not know the state of the blockchain, it cannot make some decisions. If the network is in lockstep, things are fine, but if the consenter is new, or was offline for some period of time, it will likely need some 'where is the chain now' data to bootstrap.

jyellick
2016-07-25 19:50
@simon @binhn @kostas @tuand @jeffgarratt @sheehan What is the downside to making the validator/consenter in the picture be a single entity? They seem to be necessarily tightly coupled to me. We would have ordering of transactions broken out separately via the ordering service, we would have execution of chaincode separately via the endorsers, The middle piece 'peer/validator/consenter' would then be all the logic for manipulating and maintaining the chain. There's no need for information to leak to the ordering service, so put the 'as a service' boundary there. By trying to split them, it seems like we're introducing a lot of headaches and unwanted/unneeded complexity. We could preserve a 'consensus plugin' API boundary as we do today, which abstracts the ledger details and network implementations away, but the more I look, actually keeping the consenter and validator in separate processes seems to have more downsides than up.

binhn
2016-07-25 19:57
i am ok merging them together - as long as we keep the orderrers separate — btw, the terminology becomes so confusing at this point: the wiki doc calls consenter as orderrer, and now we intro validator/consenter, which to me could all be functions of a committer

jyellick
2016-07-25 20:05
Yes, I think the orderer boundary is a clean one we should definitely maintain. The ability to plug in other ordering mechanisms like Kafka is a must have. And I'd maintain it's a semantic distinction; that "Consensus as a service" isn't actually what people want, it's "Ordering as a service", where the actual consensus with knowledge of content is done locally. I'm worried that trying to gRPC the consenter/validator boundary is going to produce a large and brittle interface, with 1-1 network topology, and I'm just not sure who really wants to plug in here? For PoW/PoET, this can be plugged in at the 'ordering as a service' layer.

kostas
2016-07-25 20:26
As I pointed out privately, my main concern is keeping the orderers (the bottom layer in Simon's branch) separate and it seems like we're good here; this is being addressed both in NCAP and in the detailed designs we're considering now. I don't see a need to keep the validation and consenters (as shown below) as separate processes.

kostas
2016-07-25 20:27

kostas
2016-07-25 20:27
Could we maybe argue that this separation is necessary if we want to experiment with a design where consensus is entirely optional?

jyellick
2016-07-25 20:28
What do you mean "where consensus is entirely optional"?

kostas
2016-07-25 20:28
(If you think that's out of the question, I'd like to point out that this is what NCAP proposed up until now. And still does actually.)

kostas
2016-07-25 20:29
As per the NCAP document: no gossiping on whether you and I have the exact same chain, because we trust in the deterministic, local filtering process.

jyellick
2016-07-25 20:30
I guess I am not suggesting that we would not still have multiple 'consensus plugins', including 'noops 2.0' which would simply write the unvalidated block to the chain (or maybe optionally the validated block) [though this I would argue is more of a validation policy than a different plugin]. But I don't see why this makes a different process necessary.

kostas
2016-07-25 20:34
(You deleted the sentence I was about to object to :grin:)

jyellick
2016-07-25 20:35
Yes, sorry, hit send a little quickly.

binhn
2016-07-25 20:35
if ordering is optional, then we would just not call it — noops and neither gossiping on block

binhn
2016-07-25 20:36
so i don’t have a reason to keep consenter and validator separate either

simon
2016-07-26 08:12
i'm fine with merging validator and consensus into one process

simon
2016-07-26 08:13
in that case we're talking about defining the internal API

simon
2016-07-26 08:13
it needs to be a concise API so that we can at least have two implementations for atomic broadcast and for PoW

mgk
2016-07-26 09:54
has joined #fabric-consensus-dev

simon
2016-07-26 11:54
hi

simon
2016-07-26 11:57
what's our answer to lost transactions in v0.5?

simon
2016-07-26 11:58
can happen, won'tfix?

tuand
2016-07-26 13:02
is there an issue opened for that ? I would agree that we just document for 0.5 and see what we should do with the new arch

simon
2016-07-26 13:14
there are several

tuand
2016-07-26 13:15
aha, i'm supposed to talk to scottz today about some issues ... must be these

simon
2016-07-26 13:16
yes

simon
2016-07-26 13:16
if you're okay with closing those then i will

simon
2016-07-26 13:16
because now i can :slightly_smiling_face:

tuand
2016-07-26 13:16
hahahaha :slightly_smiling_face: let me talk to him first. I'll get him to close

jeffgarratt
2016-07-26 13:17
@jyellick: @simon when Jason comes back online I will ask if he has time to continue discussion

tuand
2016-07-26 13:17
btw, can you label issues now ?

simon
2016-07-26 13:17
yes

simon
2016-07-26 13:17
jeffgarratt: great

simon
2016-07-26 13:19
vita: ^^

simon
2016-07-26 13:19
@vita: ^

jyellick
2016-07-26 13:20
@jeffgarratt: @simon I'm online

jeffgarratt
2016-07-26 13:23
k, I need about 10 mins.

simon
2016-07-26 13:28
can we do a telco? not sure if i can do hangouts when transiting

simon
2016-07-26 13:28
tho

jyellick
2016-07-26 13:28
Sure

simon
2016-07-26 13:28
i can try

jyellick
2016-07-26 13:28
I'll send you guys the PC out of band

simon
2016-07-26 13:29
ah no, i think hangout will work

simon
2016-07-26 13:29
that's good

mandler
2016-07-26 14:11
I'll be happy to join in Vita's place (she can't join at the moment). Please let me know when and how to connect

binhn
2016-07-27 00:32
@binhn uploaded a file: https://hyperledgerproject.slack.com/files/binhn/F1VDNHG4E/fabricnext-community.pptx and commented: Deck I discussed with community today

simon
2016-07-27 09:10
so we're going to use jira for issue tracking?

simon
2016-07-27 09:10
in my experience that's the slowest interface ever

tuand
2016-07-27 12:48
ok, so can someone summarize what jason/simon/jeff discussed yesterday ? or is there code to look at ?

simon
2016-07-27 12:48
we decided to start writing code

simon
2016-07-27 12:49
prototyping the new architecture

tuand
2016-07-27 12:49
ah, ok . your branch ?

simon
2016-07-27 12:51
sort of, but now we have that whole gerrit thing going

simon
2016-07-27 12:51
so i don't know how that will work

tuand
2016-07-27 12:52
ya , i've got to set that up today

simon
2016-07-27 12:52
well, there is no code in gerrit yet

simon
2016-07-27 12:52
my hope is that we will start from 0

simon
2016-07-27 12:52
but probably not

kostas
2016-07-27 12:57
We also drew out yet another diagram with APIs to make the flow a bit clearer, and roughly sketched out how the most basic Behave test will look like for the new prototype.

tuand
2016-07-27 12:59
post the diagram here ?

kostas
2016-07-27 13:01
Jeff has a picture of it handy, I don't.

tuand
2016-07-27 13:02
ok mr Garratt @jeffgarratt :slightly_smiling_face:

simon
2016-07-27 14:18
@jeffgarratt, @kostas, @jyellick: what happened yesterday?

simon
2016-07-27 14:18
i'm in the code, trying to extract changesets from transactions

kostas
2016-07-27 14:34
@simon: We took another stub at drawing out the APIs. Not a rewrite of your effort per se, but mostly an exercise to make sure we (Jeff, Jason, and I) are on the same page. (And as such, I think the drawing would make little sense when shared outside the group. Jeff has it though.)

kostas
2016-07-27 14:34
We also worked on the minimum network configuration needed (in terms of endorsing peers, committing peers) and the simplest behave scenario for the new architecture. Jeff will chime in with an update on this.

kostas
2016-07-27 14:34
It may also be wise for us to sync up and do a rough task assignment of sorts.

simon
2016-07-27 14:36
okay, i'll just hack code

simon
2016-07-27 14:36
can't stand not hacking code

jeffgarratt
2016-07-27 14:37
gonna grab a coffee and back in 10 mins

jeffgarratt
2016-07-27 14:37
will shoot message and see if we can't sync

jeffgarratt
2016-07-27 14:37
@simon: @kostas @jyellick @tuand ^^

mandler
2016-07-27 15:47
We (HRL) came up with a proposal for a gossip network for the new architecture, to accommodate different communication / dissemination needs among the different entities. I'll post a short document with the main ideas. I'd be happy to get your feedback on that, and discuss further.



nishi
2016-07-27 21:41
has joined #fabric-consensus-dev

cca
2016-07-27 22:00
@simon, @kostas: do you have the result of sketching the APIs shared here?


kostas
2016-07-28 03:01
@cca: (^^ as requested)

yacovm
2016-07-28 10:24
has joined #fabric-consensus-dev

cca
2016-07-29 04:43
:slightly_smiling_face:

mihaig
2016-07-29 10:54
has joined #fabric-consensus-dev

sanchezl
2016-07-29 14:25
has joined #fabric-consensus-dev

nits7sid
2016-08-03 15:33
How is XFT different from PBFT??

jyellick
2016-08-03 16:13
@nits7sid: I am not an XFT expert, but my understanding is that XFT makes different assumptions about the behavior of the network, in particular that byzantine nodes and byzantine network behavior do not happen at the same time. @vukolic or @cca might have a better answer

kostas
2016-08-03 16:17
Section 3 in the XFT paper should get you covered. It's short and easy to parse: http://arxiv.org/pdf/1502.05831v2.pdf

louisw
2016-08-03 16:55
has joined #fabric-consensus-dev

vukolic
2016-08-03 20:41
As @kostas mentioned Section 3 of the paper is a good start

vukolic
2016-08-03 20:42
there is a neat similarity between XFT and PBFT in that they are both OSDI papers :wink:

nits7sid
2016-08-04 02:40
@jyellick: thanks...

nits7sid
2016-08-04 02:44
the current PBFT is not not tolerating crash faults??

jyellick
2016-08-04 02:47
@nits7sid: The current PBFT implementation should be crash fault tolerant, do you have a scenario which is failing?

nits7sid
2016-08-04 02:56
ohh.. I didn't noticed yet any. I am just trying to know how XFT will benefit and what impacts will it have on the network as compared to current PBFT

nits7sid
2016-08-04 03:21
@jyellick: could you please explain me the system anarchy..

shubhamvrkr
2016-08-04 03:39
Hi ! I had a doubt on new consensus architecture. If i have network with 2 submitting peers and 2 endorsing peers. if SP1 deploys the chaincode with EP1 and EP2 in the ccEndorserSet then after the consensus service, the proposal says that the consenters will commit to the peers. So my question is to which peers will they send commit ?

simon
2016-08-04 09:35
shubhamvrkr: all nodes that maintain a ledger will receive the block and commit

shubhamvrkr
2016-08-04 09:53
ohh okay

shubhamvrkr
2016-08-04 09:53
thanks:)

simon
2016-08-04 10:11
shubhamvrkr: what problem did you have in mind?

shubhamvrkr
2016-08-04 10:28
Actually i want only those peers who are executing the transactions to have blocks received. Rest peers should not receive the blocks

simon
2016-08-04 10:32
why?

simon
2016-08-04 10:32
then it is no more a blockchain

simon
2016-08-04 10:32
but multiple blockchains

shubhamvrkr
2016-08-04 11:06
yes..something similar

simon
2016-08-04 11:07
why do you want this arrangement?

shubhamvrkr
2016-08-05 10:19
if a party is not interested in some transactions that does'nt involve him then y he should have those transactions?....isnt his storage is simply getting wasted with irrelevant transaction

simon
2016-08-05 11:35
that's the whole point of blockchain

simon
2016-08-05 11:36
that everybody sees the same data

simon
2016-08-05 13:26
should we work on a variation of our pbft, without watermarks, and simplified operation?

jyellick
2016-08-05 14:03
I think ultimately it would be valuable, though this goes back to my belief that we should commit to supporting transports without ordering (and then use one), or commit to using ordering, and simplify the code.

jyellick
2016-08-05 14:04
The assumptions are sufficiently different, I'm not convinced they can be easily maintained in parallel (maybe you disagree?)

nick
2016-08-05 22:05
hi everyone. is there some kind of documentation for the new endorser committer model for consensus?

tuand
2016-08-05 22:43
@nick current documentation is what's on the wiki page describing the new architecture

nick
2016-08-05 23:22
@tuand. thanks. got it.

alexho
2016-08-07 15:49
has joined #fabric-consensus-dev

simon
2016-08-08 08:18
@jyellick: i realized that we're building an outside "byzantine" consensus if we're doing the commiter-signature gossip thing

shubhamvrkr
2016-08-08 09:14
i am using PBFT with N=4 and f=1. What will happen if one node is Byzantine and the other node is crashed. will the system still remain stable ?

simon
2016-08-08 09:15
it will stop processing transactions

simon
2016-08-08 09:15
once the crashed node comes back up, it will continue

shubhamvrkr
2016-08-08 09:17
i was going through the XFT paper and it said that it can handle both non crash and crash faults but with limited resources.i.e. n=2f+1. How is XFT acheiving this?

simon
2016-08-08 09:18
i believe it works by assuming that network faults and byzantine nodes do not happen at the same time

shubhamvrkr
2016-08-08 09:19
oh and what about the PBFT? then why pbft requires 3f+1 nodes?

simon
2016-08-08 09:24
because that's what you need in the face of concurrent network outage and byzantine nodes

simon
2016-08-08 09:24
do you have any specific questions about pbft?

shubhamvrkr
2016-08-08 09:31
so in pbft when crash and non crash fault happens at a same time (in case f =1 and n=4 ),then the system will be consistent (i.e state of remaining correct 2 nodes is guranteed to be same )but the system will only have to wait till the crashed node comes up to continue with the consesus ?

shubhamvrkr
2016-08-08 09:32
and in case of XFT the cosistency might not be guranteed (in case f=1 and n = 3)?

shubhamvrkr
2016-08-08 09:32
correct me if m wrong

simon
2016-08-08 09:40
yes for pbft

simon
2016-08-08 09:40
i can't answer for XFT

shubhamvrkr
2016-08-08 09:41
ohh okay

shubhamvrkr
2016-08-08 09:44
on what basis is that independent events assumptions made? Any proof on this?

simon
2016-08-08 09:45
it's assumptions

simon
2016-08-08 09:46
did you read the XFT paper?

shubhamvrkr
2016-08-08 09:48
yes i read... it says that the faults occurs independently which is very reasonable in practice.. so if in the system using XFT , if both occurs at the same time then the whole system will crash right?

simon
2016-08-08 09:54
@vukolic: ^^

vukolic
2016-08-08 09:56
@shubhamvrkr: in XFT, if the count of Byzantine replicas is >0 then one counts BOTH Byzantine and CORRECT but partitioned replicas towards the threshold t

vukolic
2016-08-08 09:57
in this case, below the threshold t (inclusive) - both availability and consistency are preserved

vukolic
2016-08-08 09:57
if the count goes above t there are no guarantees

vukolic
2016-08-08 09:57
additionally, if you DO NOT have Byzantine replicas then you get the same guarantees as Paxos/Raft/ZAB

vukolic
2016-08-08 09:57
hope this helps

shubhamvrkr
2016-08-08 09:58
ohh okay

shubhamvrkr
2016-08-08 09:58
got it

shubhamvrkr
2016-08-08 09:58
@vukolic:@simon:thanks

garisingh
2016-08-08 10:09
@simon - where's the latest prototype code for the new architecture? still in your fabric fork?

simon
2016-08-08 10:11
hi gari!

simon
2016-08-08 10:11
yes, there is a hacked version that implements the new flow

simon
2016-08-08 10:12
transaction -> rest -> devops -> peer -> chaincode simulation & collect changeset; changeset -> engine -> atomic broadcast network -> engine -> ledger applies changeset

garisingh
2016-08-08 10:14
cool - which branch should I look at? I saw 2 with similar names

simon
2016-08-08 10:16
separate-consensus

simon
2016-08-08 10:17
maybe just browse the commits

garisingh
2016-08-08 10:21
cool - I was actually looking at the right stuff then. just wanted to make sure

simon
2016-08-08 10:21
right

simon
2016-08-08 10:21
what's not implemented is endorser flow

simon
2016-08-08 10:21
nor signature/changeset validation

nits7sid
2016-08-08 10:30
@vukolic: are the passive replicas in XFT part of the synchronous group?

shubhamvrkr
2016-08-08 10:50
@vukolic: why is prepare phase required in common-case when t>=2.?

vukolic
2016-08-08 10:52
@nits7sid: no @shubhamvrkr the answer would be very long :slightly_smiling_face:

vukolic
2016-08-08 10:54
let's just say it is necessary always, but optimizations are possible for t=1

shubhamvrkr
2016-08-08 10:54
ohh okay

shubhamvrkr
2016-08-08 10:55
is it because of some blocking of resources issue ?like in 2 phase commit protocol?

vukolic
2016-08-08 10:57
it is essentially because, when t=1, the first message from the primary acts as a PREPARE msg as well, so no need to send that one in the 2nd phase

shubhamvrkr
2016-08-08 10:58
oh okay

shubhamvrkr
2016-08-08 11:02
and when t=1 the primary sends his and followers reply to reply to client or just the followers reply?

shubhamvrkr
2016-08-08 11:02
i.e. m1 to c accoding to the paper

vukolic
2016-08-08 11:06
yes (the primary sends his and followers reply to reply to client )

simon
2016-08-08 11:06
t=1 is boring

vukolic
2016-08-08 11:07
it is - but it shows drastic difference wrt. PBFT t=1

vukolic
2016-08-08 11:07
the fact it is boring is beautiful :slightly_smiling_face:

shubhamvrkr
2016-08-08 11:07
:smile: that is true @vukolic

simon
2016-08-08 11:08
i mean, there is an optimization for t=1, but you also only have very few nodes

vukolic
2016-08-08 11:09
for more nodes - what needs to be improved in that protocol is rotation across synchronous groups

vukolic
2016-08-08 11:09
combinatorial rotation is ok for t=1 and t=2 but for large values of n and t won't do it

vukolic
2016-08-08 11:10
the protocol was invented 4 years ago w/o blockchain in mind at that moment

vukolic
2016-08-08 11:12
when one had to convince people of the very need of Byz fault tolerant protocols - let alone the need for those that scale well

shubhamvrkr
2016-08-08 12:32
@vukolic: how will the view change take place when primary turn byzantine when t=1?

vukolic
2016-08-08 12:34
@shubhamvrkr: let's please not abuse this channel too much - I suggest you carefully read the whole paper (http://arxiv.org/pdf/1502.05831v2.pdf) esp Sec 4.2 and contact me on private or email in case of particular questions

jyellick
2016-08-08 13:59
@simon To your comment much earlier: > i realized that we're building an outside "byzantine" consensus if we're doing the commiter-signature gossip thing I completely agree. We are leveraging atomic broadcast to build this external byzantine tolerant consensus. We may choose to re-use the signatures from PBFT nodes if we wish to perform no validation, but ultimately, consensus occurs at the peer, while atomic broadcast ordering (which requires its own internal consensus) is a separate piece.

simon
2016-08-08 14:00
right, it will only be byzantine tolerant if we know N and configure the required signatures to be appropriate

jyellick
2016-08-08 14:00
Correct

simon
2016-08-08 14:00
right

vita
2016-08-08 14:25
@jyellick: @simon To your discussion about: "commiter-signature gossip thing". can you please give more details? we had some discussions last week with Marko @vukolic at Haifa about signatures from consenters to commiters

simon
2016-08-08 14:26
i don't quite know, i guess one design idea is to have committers sign blocks and exchange signatures

vita
2016-08-08 14:28
We were more focused on propagation from consenters to commmiters and how to sign this messages

simon
2016-08-08 14:29
okay

jyellick
2016-08-08 14:31
@vita Yes, I believe the idea was to have a 'block validation system chaincode', which would define a policy for bringing a block into the chain. Something like "It has k signatures from the following n public keys"

jyellick
2016-08-08 14:40
The nice/elegant thing about defining block acceptance through such mechanism, is that you can update it simply by adding a new block with a new policy, so if you want to bring other public keys in, or blacklist others, it's relatively simple. Similarly, the validation policy is tied to the chain, so that as the policy changes, it's possible to verify that older blocks still matched whatever the policy was at their time of inclusion.

nick
2016-08-08 17:17
@simon I am trying to look at the code for the new flow. i see that you have mentioned its under the name separate-consensus. where is this exactly?

nick
2016-08-08 17:21
I should have been a little bit more clear. When I say new flow, I mean the consensus flow based on the new consensus architecture



nick
2016-08-08 17:25
@kostas.. Thank you!

troyronda
2016-08-12 11:09
has joined #fabric-consensus-dev

jyellick
2016-08-12 19:57
@simon https://github.com/jyellick/fabric/commit/babe257fb6fe2f571559aa8de51e0b244da3579f Still work pending, in particular tests, but there is a first pass implementation of 'solo' using the new offset/history aware proto api for the orderer. I ended up excluding the security stuff for now as it wasn't necessary for this first pass. If you wish to give it a test drive, you can build an 'orderer' 'reader' and 'submitter' binary by simply typing `go build` in their respective directories. The reader just dumps blocks out to stdout as they are created, and 'submitter' submits a single new transaction blob to the orderer. It should be safe to run multiple readers and submitters, though obviously only one solo orderer. All of the configuration is simply hard coded in go for the moment and something else can be plugged in later.

simon
2016-08-15 08:30
okay

simon
2016-08-15 08:30
i'll have a look

lefkok
2016-08-15 15:26
has joined #fabric-consensus-dev

jyellick
2016-08-16 14:25
@simon @jeffgarratt @kostas Discuss orderer bdd here?

jeffgarratt
2016-08-16 14:25
got it

jeffgarratt
2016-08-16 14:25
yes

simon
2016-08-16 14:34
yes

simon
2016-08-16 14:34
when we say bdd, does it always mean behave with docker containers, etc.

simon
2016-08-16 14:35
or can it also mean writing a go test that captures the behavior?

jeffgarratt
2016-08-16 14:39
either

kostas
2016-08-16 14:39
@jyellick: I'm here and good to go.

jeffgarratt
2016-08-16 14:39
I was considering porting the python BDD to ginkgo or similar

jeffgarratt
2016-08-16 14:40
want to hangout?

kostas
2016-08-16 14:40
@jeffgarratt: I'm a fan of http://goconvey.co/

jeffgarratt
2016-08-16 14:40
saw it :slightly_smiling_face:

jeffgarratt
2016-08-16 14:40
that was the 'similar'

kostas
2016-08-16 14:41
ACK

jyellick
2016-08-16 14:41
I am hugely in favor of _not_ doing the dockerized stuff

jyellick
2016-08-16 14:41
Tests that don't run quickly are significantly less useful

kostas
2016-08-16 14:41
I agree, I am working on a way to drop the Docker dependency for Kafka as well.

jeffgarratt
2016-08-16 14:42
I would advise caution

jeffgarratt
2016-08-16 14:42
can we hangout to discuss?

jeffgarratt
2016-08-16 14:43
my concern is that these discussions are useless in time as the scroll from our history

jeffgarratt
2016-08-16 14:43
thus making hangouts more efficient

jyellick
2016-08-16 14:43
I am fine with hangouts, not sure if that works for everyone else

jeffgarratt
2016-08-16 14:44
or phone call, whatever is best for all

kostas
2016-08-16 14:44
The less we use this Slack (which doesn't record messages) the better off we are. I barely take the time to write here anything any more.

kostas
2016-08-16 14:44
Let's do Hangouts.

jeffgarratt
2016-08-16 14:44
starting...

2016-08-16 14:44
@jeffgarratt has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/45uywkzfhbgunbyclfbvncbeyme.

simon
2016-08-16 14:45
how does the hangouts preserve the conversation?

simon
2016-08-16 14:45
does it keep the videos?

jyellick
2016-08-16 16:13
I think the argument is that slack is as ephemeral as hangouts, so why not use hangouts as it's faster

jeffgarratt
2016-08-16 16:13
exactly

jeffgarratt
2016-08-16 16:21

jeffgarratt
2016-08-16 16:22
@simon: @kostas @jyellick @tuand @binhn ^^^ orderer.feature above

jyellick
2016-08-16 17:02
https://jira.hyperledger.org/browse/FAB-43 Posted the proto and the above BDD to this issue

jyellick
2016-08-16 17:03
Don't really know how to work Jira yet.... so, not sure if you guys get notified because of being tagged or not

tuand
2016-08-16 17:03
i got the notification

jyellick
2016-08-16 17:04
Oh okay, do you know how to reference other issues in a comment? On github it was as easy as "Issue #<xxxx>" but I'm not seeing anything obviously similar

tuand
2016-08-16 17:06
i haven't tried that

garisingh
2016-08-16 17:07
just use #<xxxx>

jyellick
2016-08-16 17:09
Interesting, finally figured out where I can preview and I see the highlighting, I guess it does not show up in the editor. With #<xxxx> I get the '#' in front, it seems to link correctly just to '<xxxx>'

jyellick
2016-08-16 17:10
I guess my confusion came from expecting a list of potential choices to autocomplete like github

silliman
2016-08-16 18:07
has joined #fabric-consensus-dev

somashekar
2016-08-17 05:38
has joined #fabric-consensus-dev

simon
2016-08-17 09:45
you can just say FAB-123

simon
2016-08-17 09:45
and it will link

jyellick
2016-08-17 13:52
@jeffgarratt: @simon @kostas Just uploaded a new ab.proto with some name changes, see https://jira.hyperledger.org/browse/FAB-43 for details, would appreciate feedback

jeffgarratt
2016-08-17 13:52
k, will look shortly

jeffgarratt
2016-08-17 13:52
and regen

jyellick
2016-08-17 13:53
You'll also notice per discussion with Kostas yesterday the semantics of the old 'update' now 'deliver_update' has changed to be the protobuf 'oneof' which is likely the more disruptive change, the others should be search and replace

kostas
2016-08-17 13:55
Just had a look, these are more expressive names indeed. :+1:

simon
2016-08-17 13:57
hi

simon
2016-08-17 14:01
will join the call in a sec

gemsiva
2016-08-17 15:16
has joined #fabric-consensus-dev

ramesh
2016-08-17 21:05
has joined #fabric-consensus-dev

ittaia
2016-08-18 14:09
has joined #fabric-consensus-dev

sri_narayanan
2016-08-18 22:36
has joined #fabric-consensus-dev

simon
2016-08-19 11:44
i'm trying to figure out the right behavior of replicas that announce different views in view change

simon
2016-08-19 11:46
how does pbft handle replicas diverging their idea of what view should be next?

simon
2016-08-19 11:47
@vukolic: any suggestions?

simon
2016-08-19 11:47
is this just handled by the exponential increase in view change timeouts?

vukolic
2016-08-19 11:50
Exponential increase in timeouts is typically used

vukolic
2016-08-19 11:50
Id say this is to be used only until a successful view change

vukolic
2016-08-19 11:50
When the timeouts should be reset

simon
2016-08-19 12:14
is that enough to guarantee that they will all "find" each other again, with a view-change for the same view?

vukolic
2016-08-19 13:11
it should be if the system becomes synchronous so that initial timeout makes sense.

vukolic
2016-08-19 13:11
What you may also do is amplify view changes

vukolic
2016-08-19 13:12
e.g., whenever I hear about f+1 view change messages for view n and higher - I could send a view change message for that view

vukolic
2016-08-19 13:12
if I am not yet there

kostas
2016-08-19 13:13
Just a note that we do this already: we send a view-change for the smallest view in that set even if our timer hasn't expired

vukolic
2016-08-19 13:13
very good

vukolic
2016-08-19 13:13
Simon you may want to keep that in the re-write

vukolic
2016-08-19 13:14
for exponential increase - view change timeout could be reset to "normal" on checkpoints

vukolic
2016-08-19 13:14
so replicas do this "at the same time"

simon
2016-08-19 13:16
breaks my brain

harrijk
2016-08-19 13:29
has joined #fabric-consensus-dev

mohan
2016-08-19 14:06
has joined #fabric-consensus-dev

bcbrock
2016-08-19 14:22
With all of the recent shuffling of code and issues, I may have lost track of the new architecture discussion. I see that the “Next Consensus Architecture Proposal” on the old github Wiki has not been updated for about a month (same for the associated issue). Has this conversation been moved elsewhere? I gather that people are working on the new architecture - where can I find the specification that is being implemented?

simon
2016-08-19 14:22
no spec anymore

tuand
2016-08-19 14:22
@binhn: ^^^

bcbrock
2016-08-19 14:24
@simon - What do you mean by “anymore”?

binhn
2016-08-19 14:25
@bcbrock it is on the fabric mailing list archive, but i will start a jira entry next week to deposit the new material

bcbrock
2016-08-19 14:26
@binhn - Thank you

simon
2016-08-19 15:44
@kostas: you around?

kostas
2016-08-19 15:44
@simon: Yes, was actually just messaging you on view-changes in the other Slack with some notes.

obernin
2016-08-19 17:06
has joined #fabric-consensus-dev

nick
2016-08-21 05:36
hi. I am seeing the following error when I try to setup fabric dev environment in Ubuntu.

nick
2016-08-21 05:36
==> default: Setting up openjdk-8-jdk:amd64 (8u91-b14-0ubuntu4~14.04) ... ==> default: update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/appletviewer to provide /usr/bin/appletviewer (appletviewer) in auto mode ==> default: update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/jconsole to provide /usr/bin/jconsole (jconsole) in auto mode ==> default: Processing triggers for libc-bin (2.19-0ubuntu6.8) ... ==> default: update-alternatives: error: no alternatives for mozilla-javaplugin.so ==> default: update-java-alternatives: plugin alternative does not exist: /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/IcedTeaPlugin.so ==> default: docker rmi -f hyperledger/fabric-baseimage ==> default: Error response from daemon: No such image: hyperledger/fabric-baseimage:latest ==> default: make: [base-image-clean] Error 1 (ignored) ==> default: docker rmi -f hyperledger/fabric-src ==> default: Error response from daemon: No such image: hyperledger/fabric-src:latest ==> default: make: [src-image-clean] Error 1 (ignored) ==> default: docker rmi -f hyperledger/fabric-ccenv ==> default: Error response from daemon: No such image: hyperledger/fabric-ccenv:latest ==> default: make: ==> default: [ccenv-image-clean] Error 1 (ignored) ==> default: docker rmi -f hyperledger/fabric-peer ==> default: Error response from daemon: No such image: hyperledger/fabric-peer:latest ==> default: make: [peer-image-clean] Error 1 (ignored) ==> default: docker rmi -f hyperledger/fabric-membersrvc ==> default: Error response from daemon: No such image: hyperledger/fabric-membersrvc:latest ==> default: make: ==> default: [membersrvc-image-clean] Error 1 (ignored) ==> default: cd sdk/node && make clean ==> default: make[1]: Entering directory `/opt/gopath/src/github.com/hyperledger/fabric/sdk/node' ==> default: make[1]: Nothing to be done for `clean'.

nick
2016-08-21 05:37
I have been using fabric dev env in Windows and I am now trying to install in Ubuntu 14.04

nick
2016-08-21 05:37
has anyone seen this before?

tuand
2016-08-21 12:49
@nick: can you repost in # ?

nick
2016-08-21 14:37
hi tuand. sure . thanks..

jlamiel
2016-08-22 08:30
has joined #fabric-consensus-dev

cbf
2016-08-22 12:31
@simon: pls see my response to your ? https://gerrit.hyperledger.org/r/#/c/583/

simon
2016-08-22 12:51
@cbf: yea, i had opened an issue on github a long time ago. december or so

simon
2016-08-22 12:51
about error handling

simon
2016-08-22 12:51
i don't think that non-transient errors should be returned

simon
2016-08-22 12:52
the system should log + panic

cbf
2016-08-22 12:52
ok, fair enough can you point to the GH issue?

simon
2016-08-22 12:52
sure

cbf
2016-08-22 12:52
I can address

cbf
2016-08-22 12:52
thanks

simon
2016-08-22 12:52
i mean, this is not limited to consensus

simon
2016-08-22 12:52
but everywhere

simon
2016-08-22 12:53
like the whole `ledger, err := ledger.GetLedger()` dance

cbf
2016-08-22 12:53
should I remove returned error then to align signature with others in the interface?

simon
2016-08-22 12:53
i think your change is good

cbf
2016-08-22 12:53
but you’d like it to Panic, yes?

simon
2016-08-22 12:53
but there is a more fundamental issue with permanent errors

simon
2016-08-22 12:54
well, the database should, if it can't write

simon
2016-08-22 12:54
or turn itself read only


cbf
2016-08-22 12:57
I’ll create a Jira for this one

simon
2016-08-22 12:59
thanks


lhaskins
2016-08-22 20:41
I realize that you guys are pretty focused on 1.0, but I'm looking into an issue that I'm seeing from 0.5 on the zACI env. After so many consensus runs, I get locked out of a peer (or 2 or 3). I have finally been able to go through the logs to try to see what's going on and I'm seeing this: ```Received duplicate connection from <nil>, switching to new connection``` followed by numerous instances of ```grpc: ClientConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 19 2.168.122.43:30303: getsockopt: connection refused"; Reconnecting to "824d0533-5f1f-4481-b079-01ffe2fa06fc_vp1-discovery.zone.blockchai http://n.ibm.com:30303" 2016/08/22 22:11:52 transport: http2Client.notifyError got notified that the client transport was broken EOF. ESC[31m22:11:52.882 [peer] handleChat -> ERRO 029ESC[0m Error during Chat, stopping handler: rpc error: code = 13 desc = "transport is closing" ESC[31m22:11:52.882 [peer] chatWithPeer -> ERRO 02aESC[0m Ending Chat with peer address http://824d0533-5f1f-4481-b079-01ffe2fa06fc_vp2-discovery.zone.blockchain.ibm.com:30303 due to error: Error during Chat, stopping handler: rpc error: code = 13 desc = "transport is closing" ``` Any ideas?

tuand
2016-08-22 20:47
the `duplicate connection from <nil>` is more of an informational thing . I think it resolves itself pretty quickly ... the other errors look like tcp or network issues. What do the z admin folks say ? Anything in the system z logs pointing to network problems ?

lhaskins
2016-08-22 20:51
I'll share this additional info with them. They initially thought the problem was due to ssh, but I've eliminated the use of ssh to run on their network and am still seeing the issue. I thought I'd run it by you all just in case something jumped out at you since there are some errors in peer log file about chatting between peers.

simon
2016-08-23 09:28
lhaskins: this is jeff's area - consensus doesn't create connections

muralisr
2016-08-23 10:57
@lhaskins: _"transport: dial tcp 192.168.122.43:30303: getsockopt: connection refused”_ sounds like you could be running out of resources ? we need to understand the end to end scenario (creating a lot of connections ? is the client on the same box as the peer ?)

simon
2016-08-23 11:02
connection refused = port not open

simon
2016-08-23 11:02
meaning, other side isn't running any fabric peer

lhaskins
2016-08-23 13:55
@muralisr: I looked at that a bit as well. Nishi opened a similar bug a few weeks ago, but the open file limit is set to 64k as recommended. While I can see that the peer VM is up and running. I'm not able to connect with it in any way nor can any of the other peers. I'm going to break it down and hone in on it... Thanks!

simon
2016-08-23 13:57
lhaskins is that possibly a VM that you send a lot of REST requests to?

lhaskins
2016-08-23 13:58
yes, it is

lhaskins
2016-08-23 13:59
do you have any ideas where I should focus my attention?

muralisr
2016-08-23 14:00
@lhaskins: the open file limit was the reason for my original question …. but looks like you have addressed that

simon
2016-08-23 14:00
last time i saw that was that the process ran out of open files, because there were many lingering REST calls

simon
2016-08-23 14:01
i hacked that in my tree by 1. setting the request http to Connection: close, and 2. setting a timeout on the rest http service

lhaskins
2016-08-23 14:03
I'll be sure to explicitly close the connections, but I did verify that the open file limit was increased to 64K.. Thanks for the pointer

mohan
2016-08-23 14:41
Hi, I was trying to understand consensus, and I came across this new architecture https://github.com/hyperledger/fabric/wiki/Next-Consensus-Architecture-Proposal . Is this architecture already part of current fabric code?

tuand
2016-08-23 14:44
@mohan: this architecture is not implemented yet. Look at hyperledger/fabric/docs/protocol-specs.md for more info on current code

mohan
2016-08-23 15:35
@tuand Thanks, I will check upon the protocol-specs.md.

simon
2016-08-24 11:38
i just had an epiphany

simon
2016-08-24 11:39
with only one outstanding batch, i don't think we need more than an "implicit watermark" of 1

jyellick
2016-08-24 12:36
I don't know that I agree

jyellick
2016-08-24 12:37
Because although you have ordering within a single stream, you don't have it across streams, it's perfectly possible that you could receive prepares for 10 seqNos from a backup before ever receiving the first pre-prepare from the primary

simon
2016-08-24 12:50
hmm

simon
2016-08-24 12:51
so you're telling me that you can only do any sort of BFT with watermarks?

simon
2016-08-24 12:51
i agree that we should buffer messages that refer to the future

simon
2016-08-24 12:52
but that's at each replica's discretion

jyellick
2016-08-24 12:56
If you are going to buffer things in a bounded way, it seems like you have no choice but to implement a sort of watermarks

jyellick
2016-08-24 12:57
And, for garbage collection purposes, unless you want to change the view change substantially (or set K=1) then they also seem useful

simon
2016-08-24 12:57
i want that a new request will only be pre-prepared when the previous request committed

simon
2016-08-24 12:57
did you have a chance to look at my new code?

jyellick
2016-08-24 12:58
I have not

simon
2016-08-24 12:58
i'm trying to make it easier to read


simon
2016-08-24 12:58
i'd appreciate feedback

jyellick
2016-08-24 12:59
Okay, not positive if I'll get a chance to review today, but will try to tomorrow

simon
2016-08-24 12:59
and one property is that there is just one request in flight at a time

simon
2016-08-24 12:59
okay

simon
2016-08-24 12:59
i'll be out from tomorrow including tuesday

jyellick
2016-08-24 13:00
Oh okay, enjoy your vacation

simon
2016-08-24 13:00
moving stuff from berlin :confused:

simon
2016-08-24 13:00
no vacation

jyellick
2016-08-24 13:00
Oh, that is noticeably less fun

simon
2016-08-24 13:01
dealing with a guy who used my bed and wants rent from me for taking up space...

jyellick
2016-08-24 13:02
Oh, that's unfortunate, good luck

simon
2016-08-24 13:03
thanks

simon
2016-08-24 13:03
so, if the primary is only allowed to pre-prepare after a request commits

simon
2016-08-24 13:04
that means that a quorum of correct replicas must have committed the previous request

simon
2016-08-24 13:04
and they won't send a prepare for the next unless the previous request committed

simon
2016-08-24 13:05
now if i'm correct but lagged asymmetrically, i might receive some sequence numbers before the commits of the previous request

jyellick
2016-08-24 13:06
Right

simon
2016-08-24 13:06
i'm trying to figure out, is it sufficient to just talk about a maximum of 2 requests in a new-view message?

simon
2016-08-24 13:06
the most recently committed request, and the most recently prepared one?

simon
2016-08-24 13:08
the terminology always confuses me

simon
2016-08-24 13:08
the most recent one that a quorum received prepare messages and sent commit messages

simon
2016-08-24 13:09
and the most recent one that a quorum received pre-prepare messages and send prepare messages

simon
2016-08-24 13:10
i wish there was a simple bft protocol without this complicated watermark business

jyellick
2016-08-24 13:12
You could set K=1

jyellick
2016-08-24 13:13
Then you would never have to talk about prepared and committed stuff

jyellick
2016-08-24 13:13
Just pick the initial checkpoint, and work from there

simon
2016-08-24 13:14
yea, that's what i thought

jyellick
2016-08-24 13:14
Maybe you would have to deal with one prepared

simon
2016-08-24 13:14
but it isn't sufficient

simon
2016-08-24 13:14
i think

simon
2016-08-24 13:15
because in pbft, there can be multiple outstanding checkpoints

simon
2016-08-24 13:15
so L=1 too?

jyellick
2016-08-24 13:16
I think you could potentially come up with a 'correct' protocol that way, but I think it's only going to exacerbate the problem of only 2f+1 correct nodes ever being in sync

simon
2016-08-24 13:16
the question is: do i implement all the watermark stuff etc

simon
2016-08-24 13:16
or is there a simpler way

simon
2016-08-24 13:17
if we require synchrony for every request

jyellick
2016-08-24 13:17
I think the complicated piece is view change, otherwise the watermarks are pretty simple?

simon
2016-08-24 13:17
yes

jyellick
2016-08-24 13:17
So, I'd be inclined to say "Keep watermarks, but set K=1"

simon
2016-08-24 13:17
i looked at viewchange.go

simon
2016-08-24 13:17
and boy it is complicated and confusing

jyellick
2016-08-24 13:17
So your view change should be very simple, and the watermarks for checkpoint garbage collection aren't that bad

jyellick
2016-08-24 13:18
Yes, viewchange.go is a dense piece of code

simon
2016-08-24 13:19
i wanted to hardcode K=1, but marko wants to be able to checkpoint less often

simon
2016-08-24 13:19
but i think independent of that, in pbft, there can be multiple in flight checkpoints

simon
2016-08-24 13:19
even with K=1

jyellick
2016-08-24 13:19
Definitely there can be

jyellick
2016-08-24 13:20
But, the view change gets very simple, because you only operate on sequence numbers in the impending checkpoint window, which means you only have 1 seqNo to operate on

simon
2016-08-24 13:20
so the new-view becomes complicated, because suddenly there are multiple requests being "pre-prepared" at once

simon
2016-08-24 13:20
no, i don't think so

simon
2016-08-24 13:20
i think you're talking about L=1?

jyellick
2016-08-24 13:20
No

jyellick
2016-08-24 13:21
The only way a checkpoint stable cert happens is if 2f+1 have committed for that sequence number

simon
2016-08-24 13:21
yes

jyellick
2016-08-24 13:21
And the only way you prepare, is if you've committed the previous seqNo

simon
2016-08-24 13:22
that makes it L=1?

jyellick
2016-08-24 13:22
Oh, I was just following the rule you'd suggested earlier

simon
2016-08-24 13:22
ok

simon
2016-08-24 13:22
go on

jyellick
2016-08-24 13:23
So, assume that a request has prepared

jyellick
2016-08-24 13:23
Then there are 2f+1 checkpoints for the previous sequence number

simon
2016-08-24 13:23
ok, so i will only send a prepare if the previous seq did not only commit, but also reach a stable checkpoint?

jyellick
2016-08-24 13:25
I'd have to work through the corner cases, but I don't think that's necessary. On view change, everyone sends their own checkpoint store, and if there are f+1 checkpoints for a given sequence number, that's (mostly) sufficient to pick it

jyellick
2016-08-24 13:25
So my proposition is, that on view change, we would always end up picking the highest valid checkpoint which existed, which would ensure that there was only ever 1 prepared request with higher sequence number

simon
2016-08-24 13:26
ah, and the reason is that because i send a checkpoint after a request commits

simon
2016-08-24 13:27
and i also send a prepare only after a request commits

simon
2016-08-24 13:28
it implies that (short of message loss?) a checkpoint message must have been sent (and received) before the prepare?

simon
2016-08-24 13:28
unless we do something asynchronous that might send the checkpoint message any time later

jyellick
2016-08-24 13:29
Sent yes, received I don't think is necessary

jyellick
2016-08-24 13:29
Because on view change, the checkpoint store is sent

simon
2016-08-24 13:29
i'm wondering whether it is possible that there are multiple checkpoints outstanding

jyellick
2016-08-24 13:30
So, if 2f+1 have sent the checkpoint, it's in their store, and so in order to achieve 2f+1 view change messages, at least f+1 of them must include that checkpoint

simon
2016-08-24 13:30
2f+1 correct?

simon
2016-08-24 13:30
or any 2f+1

simon
2016-08-24 13:31
hmmm

simon
2016-08-24 13:31
no, i don't get it

simon
2016-08-24 13:32
or do i?

simon
2016-08-24 13:33
ah, i include the latest checkpoint (for my latest executed request) in the view-change message

simon
2016-08-24 13:34
and short of stragglers, 2f+1 must have either committed seq N or seq N+1

jyellick
2016-08-24 13:43
Now I'm second guessing the original view change protocol. In a network of 3f+1 with K=1, say f non-byzantine nodes are slow because that happens, and have their last seqNo=4 and no prepare/preprepare, the rest of the network of 2f+1, including f byzantine just wrote a checkpoint for seqNo=5. Suddenly a view change starts, but those f who are at seqNo=4 send their checkpoint store and no p-set or q-set, and the f byzantine see this, so pile on and copy them, and only 1 non-behind non-byzantine node sends in a view change message with seqNo=5 and that checkpoint.

jyellick
2016-08-24 13:46
So the view change contains 2f matching attestations that the last checkpoint was at seqNo=4, and one claiming it was at seqNo=5. I guess it's fine, the network can settle on the wrong checkpoint, because nothing will preprepare/prepare because f+1 guys are at seqNo=5

jyellick
2016-08-24 13:47
And eventually it will view change again, and hopefully eventually get a set of f+1 correct checkpoints? Really need to go reread the view change section of the paper I suppose

simon
2016-08-24 13:47
i think that's where L comes in

simon
2016-08-24 13:48
seqno=5 does not have a stable checkpoint in the view-change messages

simon
2016-08-24 13:48
so seqno=4 it is

simon
2016-08-24 13:49
so new primary then assigns seqno=5 to that request, and everybody does the prepare/commit cycle for 5

simon
2016-08-24 13:49
no?

jyellick
2016-08-24 13:49
But f+1 non-byzantine guys have already committed a request at seqNo=5

simon
2016-08-24 13:49
yes

jyellick
2016-08-24 13:50
So I would say the new primary assigns seqNo=5, and it cannot possibly prepare/commit

simon
2016-08-24 13:50
the same request will be prepared

jyellick
2016-08-24 13:50
Those f+1 won't send a prepare

simon
2016-08-24 13:50
they will

simon
2016-08-24 13:50
independent of whether they already executed it or not

jyellick
2016-08-24 13:50
How do they know it's the same request they already prepared?

jyellick
2016-08-24 13:51
They've already garbage collected that, because they hit a stable checkpoint

simon
2016-08-24 13:51
hum

jyellick
2016-08-24 13:51
seqNo=5 is under their watermark

simon
2016-08-24 13:51
yes i see

simon
2016-08-24 13:51
that view change protocol always was sketchy to me

simon
2016-08-24 13:52
or i just don't get it

jyellick
2016-08-24 13:53
I strongly suspect this is handled, but it's so complicated I have to reread and re convince myself every so often, and I guess it's that time

simon
2016-08-24 13:53
yes

simon
2016-08-24 13:53
and that makes it essentially impossible to implement without errors

jyellick
2016-08-24 13:54
We also implemented the unbounded memory growth version, which always troubled me

simon
2016-08-24 13:54
yea

simon
2016-08-24 13:54
so let's talk about simplified

simon
2016-08-24 13:55
conceptually, the view change should be the same as all nodes crash and come back up

simon
2016-08-24 13:55
right?

jyellick
2016-08-24 13:56
Sure

simon
2016-08-24 13:57
so if everything is sequential, we have (a) the last request we executed

simon
2016-08-24 13:58
and possibly (b) a subsequent request that we sent a commit for (because we received a quorum of prepares)

simon
2016-08-24 13:58
and then there is (c) we sent a prepare for because we received a pre-prepare

simon
2016-08-24 14:00
either there are 2f+1 for (a) and seqno=X

simon
2016-08-24 14:00
or there are 2f+1 for (a) and seqno=X and seqno=X+1

simon
2016-08-24 14:00
right?

jyellick
2016-08-24 14:01
So (a) is equivalent to "we have a quorum of commits"

simon
2016-08-24 14:01
yes

simon
2016-08-24 14:01
well no

simon
2016-08-24 14:01
even more

simon
2016-08-24 14:01
a quorum of checkpoints

simon
2016-08-24 14:02
it implies a quorum of commits, and a quorum of prepares

jyellick
2016-08-24 14:02
Ah, okay. Moving on then: > either there are 2f+1 for (a) and seqno=X There are 2f+1 what's? What is X? > or there are 2f+1 for (a) and seqno=X and seqno=X+1 Again, 2f+1 whats? and how can seqNo=X and X+1

simon
2016-08-24 14:03
2f+1 nodes claim that they have executed seqno=X

simon
2016-08-24 14:03
some might claim they executed seqno=X+1

jyellick
2016-08-24 14:04
Well, you have to subtract f, no? Because f of your 2f+1 may be byzantine

simon
2016-08-24 14:04
yes

simon
2016-08-24 14:04
hm

simon
2016-08-24 14:05
but that's fine

simon
2016-08-24 14:05
because that still leaves at least f+1 correct replicas to have executed seqno=X

jyellick
2016-08-24 14:06
Yes

simon
2016-08-24 14:06
so some of these 2f+1 may talk about having executed X+1

simon
2016-08-24 14:07
and that can include correct replicas

simon
2016-08-24 14:07
but for that to be true, 2f+1 must talk about having sent a commit for X+1

simon
2016-08-24 14:07
right?

simon
2016-08-24 14:08
so, that's (b)

jyellick
2016-08-24 14:09
But the part that seems really tricky to me

jyellick
2016-08-24 14:09
Is that traditionally you can only wait for 2f+1 replies

simon
2016-08-24 14:09
not during new-view

jyellick
2016-08-24 14:09
So, if f are byzantine, and f are behind

simon
2016-08-24 14:10
the paper says that you may have to wait for more than n-f view-change messages

simon
2016-08-24 14:10
hm i see

simon
2016-08-24 14:10
yes, so it is f+1 that are actually sufficient?

simon
2016-08-24 14:12
and that dense fig. 4 in the pbft paper doesn't help either

jyellick
2016-08-24 14:22
I don't know that f+1 is actually sufficient, for risk of getting to a valid but old point

jyellick
2016-08-24 14:22
This is why you need 2f+1 agreement on some things, low watermarks maybe?

simon
2016-08-24 14:22
so i think first, f+1 correct replicas have to be at state X or state X+1

simon
2016-08-24 14:22
right?

jyellick
2016-08-24 14:23
I think 2f+1 need to agree that the lowest state is X, which should be able to happen

simon
2016-08-24 14:26
so we send current state and previous state in the view-change message?

simon
2016-08-24 14:26
that seems reasonable

simon
2016-08-24 14:26
and aligns well with the block chain thing

jyellick
2016-08-24 14:28
So, f+1 attesting to a state at or above what the highest state 2f+1 are aware of

jyellick
2016-08-24 14:28
I think works

simon
2016-08-24 14:30
i don't understand that sentence

simon
2016-08-24 14:31
if there are more than f at a state X, then X is the new starting point

jyellick
2016-08-24 14:33
I don't think that's valid

jyellick
2016-08-24 14:34
If there are more than f at state X, then X is a valid state, but may be old

jyellick
2016-08-24 14:34
I think first, you need 2f+1 to agree that state X is not an outdated state

simon
2016-08-24 14:34
by being in that state or in the state after it?

jyellick
2016-08-24 14:34
By being in that state or in a state before it

simon
2016-08-24 14:34
ok, same thing

simon
2016-08-24 14:34
:slightly_smiling_face:

simon
2016-08-24 14:35
a state?

simon
2016-08-24 14:35
or the state before it?

jyellick
2016-08-24 14:36
So if f+1 assert that seqNo=10, with checkpoint of foo, then we need 2f+1 (including those f+1, so f additional) to assert that they are not past seqNo=10

jyellick
2016-08-24 14:36
f+1 gives you correctness, and 2f+1 gives you that there are not f+1 with a newer correct state

simon
2016-08-24 14:37
so what if some assert seqno=11

simon
2016-08-24 14:37
correct replicas

jyellick
2016-08-24 14:39
Well, if there are f+1 correct replicas asserting seqno=11, then we will not get 2f+1 asserting that 10 is a current state, so, we will not produce a valid view starting from 10

jyellick
2016-08-24 14:39
Instead, the view change messages will continue to collect until those f+1 correct replicas vouch for seqNo=11, and that will be chosen as the starting point

simon
2016-08-24 14:40
what if <f+1 replicas are qt seqno=11?

simon
2016-08-24 14:40
correct ones

jyellick
2016-08-24 14:40
That's not possible?

simon
2016-08-24 14:41
why?

simon
2016-08-24 14:42
couldn't it that seqno=11 prepared, but then some network outage prevented commits from reaching everybody?

jyellick
2016-08-24 14:42
Oh, so yes, that is possible

jyellick
2016-08-24 14:42
Right

simon
2016-08-24 14:42
wow, the brain gymnastics

jyellick
2016-08-24 14:42
Which is why you still have to contend with the that single prepared possibility for that sequence number

simon
2016-08-24 14:43
yes

jyellick
2016-08-24 14:45
However, in that case, since it did not commit everywhere, you would have f+1 asserting seqNo=10, and only f asserting that 11 was the current state, so you could successfully view change to seqNo=10, but, in order for anyone to have committed, it must have prepared at at least f+1 correct replicas, so it should be included in the view change. This gets back to my original question though, of if f are byzantine and lie about that prepare, and f are not included in the view change message, that only leaves one person attesting to the prepare, which I didn't think was enough to include it

jyellick
2016-08-24 14:46
But, I suppose since f+1 valid replicas did prepare it, that nothing else should prepare for that sequence number

simon
2016-08-24 14:47
why f+1 valid?

simon
2016-08-24 14:47
ah yes

simon
2016-08-24 14:47
but if f assert that 11 was current state, that doesn't say anything

simon
2016-08-24 14:48
but >=f+1 assert that 11 prepared (i.e. they sent commit messages)

simon
2016-08-24 14:48
actually 2f+1

simon
2016-08-24 14:48
well, f+1 is sufficient

simon
2016-08-24 14:49
to pre-prepare 11 again

simon
2016-08-24 14:53
my feeling is that a replica that restarts should do the same

simon
2016-08-24 14:53
and send a view-change

simon
2016-08-24 14:53
or very similar

simon
2016-08-24 14:53
to do state transfer

jyellick
2016-08-24 14:54
Yes, I think restarting should basically initiate a view change

simon
2016-08-24 14:58
or a state transfer

jyellick
2016-08-24 15:00
Well, by view-change I mean to have the network sync with the new replica on a starting point

jyellick
2016-08-24 15:00
In an ideal world, the view change would not even need to change leaders

jyellick
2016-08-24 15:01
Just have everyone on the network compute and send a view change message to the new replica, and it should be able to compute a starting point

simon
2016-08-24 15:01
yes

jyellick
2016-08-24 15:01
More of a 'current-view' than a 'view-change'

simon
2016-08-24 15:01
although they might refer to different commits

simon
2016-08-24 15:01
but then there would be a commit certificate for that replica to pick up on

jyellick
2016-08-24 15:02
Yes, I suppose the reason why view-change works, is because the network halts during it

simon
2016-08-24 15:07
okay, i'll have to let that sink in

simon
2016-08-24 15:07
would be great if you could give some comments on the code i wrote

jyellick
2016-08-24 15:07
Yes, I'll try to get to that tomorrow, or at the very least before Tuesday

jyellick
2016-08-24 15:08
(Since I assume you'll not be doing work until your return then?)

simon
2016-08-24 15:10
yep

nikileshsa
2016-08-24 21:55
has joined #fabric-consensus-dev

pushpalatha
2016-08-25 06:59
has joined #fabric-consensus-dev

2016-08-29 04:04
@poly commented on @binhn’s file https://hyperledgerproject.slack.com/files/binhn/F1VDNHG4E/fabricnext-community.pptx: Hi binhn, how is this going, is there any newer plan? I'm concerned more on membership HA and 3rd party integration. Thanks!

matanyahu
2016-08-29 19:51
has joined #fabric-consensus-dev

matanyahu
2016-08-29 19:53
Hi - I was curious what recommended deployment topology would you implement for a scenario of a federated blockchain where two parties are in equal terms with each other. Would you deploy 2 * 4 validating peers and two certificate authorities on two physical sites?

mohan
2016-08-29 19:54
Hi, I am currently trying to understand how consensus works underneath. I created 4 node peer and a memersrvc and set the consensus to pbft in batch mode. I deployed the chaincodeexample002, ran few queries, verified that after every transaction a new block being created. I also stopped two peers and found that even the when transaction is done, new blocks are not being added. Once there are 3 peers those pending transaction will be written to the ledger. I want to understand more on the low level details of consensus on how these transaction blocks are appended/created. How the message exchange between the peer node happens? How the group of transactions(blocks) are exchanged and arrive at consensus? Is there a way for me to know about it or enable logs or to pause in between to view the transaction or blocks or messages?

matanyahu
2016-08-29 19:54
Otherwise, I still do not get how to avoid a SPOF in case of certificate authorities signing eCerts and tCerts into the blockchain, like in a scenario where a CA infrastructure is in one physical site which is suddenly disconnected from another site. That would implicitly cease that second site from being capable of creating new transactions or adding new users into the network

matanyahu
2016-08-29 19:55
@mohan : isn't it the premise of PBFT, that f=(N-1)/3 always has to equal <1 ?

mohan
2016-08-29 19:57
@matanyahu Yes, you are right. In case of 4 node peer there should be atleast 3 online peers. I am trying to understand low level working. Do you have any idea on how the message exchange happens and consensus are reached?

matanyahu
2016-08-29 20:00
@mohan : nothing more than what you can read on Bluemix documentation :slightly_smiling_face: https://console.ng.bluemix.net/docs/services/blockchain/etn_pbft.html

tuand
2016-08-29 20:01
PBFT protocol is described in the paper by Castro and Liskov: Practical Byzantine fault tolerance and proactive recovery

matanyahu
2016-08-29 20:02
@tuand : I have it on my reading list :slightly_smiling_face:

tuand
2016-08-29 20:02
in a nutshell, a network of N nodes can function even if there are f failing ( or byzantine) nodes , where f = (N-1)/3

tuand
2016-08-29 20:03
so in this case N=4, f =1 , network still works when 1 node is out. but if 2 nodes are out, we cannot get to consensus

matanyahu
2016-08-29 20:03
I was curious what would happen if we have a federation/consortium network deployed in two sites, each having 4 approval peers, and suddenly a connection between the two would be cut. How does PBFT behave in split brain scenario and afterwards, when a reconciliation has to happen?

tuand
2016-08-29 20:04
the fabric pbft implementation pretty much follows the protocol described in Castro and Liskov

tuand
2016-08-29 20:07
I don't think you can have a split like you describe ... you have to think of it as a 8 peer network ... and yes membersrvc is a single point of failure. There's work on doing in that in an HA way

tuand
2016-08-29 20:07
maybe @jonathanlevi can chime in on membersrvc

matanyahu
2016-08-29 20:08
@tuand : if i think of it as an 8 peer network then I basically get a temporary fork into two world states

tuand
2016-08-29 20:10
you don't have consensus unless you have 2f+1 commit messages

matanyahu
2016-08-29 20:11
you mean, if there were initially 8 peers then it is assumed that in case of a lost connectivity the whole network is going to cease to function because all 8 validating peers know that there is 8 of then in the network?

tuand
2016-08-29 20:11
pbft is a sequence of 3 messages 1 pre-prepare followed by 2f prepares followed by 2f+1 commits

jyellick
2016-08-29 20:12
@matanyahu An 8 node network can tolerate at most f=2, so, in your case, in an 8 node network, split in two, each half would be experiencing f=4, so the network would halt until connectivity returned

matanyahu
2016-08-29 20:12
ok, got it

matanyahu
2016-08-29 20:12
that is why it is not possible to dynamically add new nodes

tuand
2016-08-29 20:12
yes, N is fixed. we're looking at dynamic addition of nodes but we can use all the help we can get :slightly_smiling_face:

matanyahu
2016-08-29 20:12
got it

matanyahu
2016-08-29 20:14
hopefully this option will be available in GA version of fabric because this is what IBM promises to the clients :wink:

jyellick
2016-08-29 20:15
@matanyahu If you look carefully at the next architecture, you'll see that we already allow dynamic addition of endorsing peers, and of validating peers. We have plans to allow dynamic addition of PBFT ordering replicas eventually, but you might find your use case is already handled by the other options.

matanyahu
2016-08-29 20:16
I will be making a presentation abour hyperledger fabric architecture tomorrow, I would like to know of known limitation to the current release and what @tuand mentioned is really important.

matanyahu
2016-08-29 20:17
@jyellick : will read it again. I looked at it couple of days ago but I think I had to concentrate too much on peer differentiation topic

jyellick
2016-08-29 20:21
@matanyahu In the 0.5 release, you can think that the concepts of endorsement (execution of chaincode), ordering (pbft), and validation (removal of invalid transactions and updating of DB) were all stuck together in one package. These three have been broken out into separate concepts. The validation and endorsement pieces scale horizontally with relative ease, as they leverage the log replication facility of the ordering. To dynamically add ordering nodes in a BFT way is challenging. Although we intend to handle this, we don't want to do this in a haphazard way, as there is considerable academic rigor surrounding the PBFT protocol and we want to make sure we don't lose any of its proven guarantees (such as liveliness).

mohan
2016-08-29 20:22
@jyellick So when the network has more (N-1)/3 failing nodes, the network would halt. What would happen to transactions that are being carried on the peers which are online. Will those be added back once the network is online?

jyellick
2016-08-29 20:25
So, in cases of extreme failure, it cannot be guaranteed that a transaction is not lost

jyellick
2016-08-29 20:26
However, assuming those nodes are online and healthy, then yes, once the network is re-established, pending transactions should process normally

jyellick
2016-08-29 20:28
(As an example of extreme failure, a transaction comes into a PBFT node, who tries to broadcast it to the network, but is experiencing a network failure, and then crashes)

jyellick
2016-08-29 20:30
And, as a nitpick, usually f=floor((N-1)/3), but nothing prevents someone from running a network of N=10 nodes, with f=1, although this would be an unusual configuration.

jyellick
2016-08-29 20:49
[And note, I would not recommend running a network of N=10, f=1, as this would allow for the network to bifurcate in the situation you described. Since each half of the bifurcated network has 2f+1 participating nodes, they can proceed. When f=floor((n-1)/3) we have that 2f+1 = 2*floor((n-1)/3) +1 > 2*((n-1)/3-1/2)+1 = 2/3*n - 2/3 - 1 + 1 = 2/3*n - 2/3 = 1/2*n + 1/6*n - 2/3 >= 1/2 * n for n >= 4. So, by setting f=floor((n-1)/3) your network is protected from network partitions causing forks, because a partition must have more than half the network present.]

kostas
2016-08-29 23:25
Alternatively, if you _must_ insist on picking a total number N that's > 3f, you need (N+f)/2 messages rounded up before preparing or committing a request (versus 2f+1, as described in the Castro paper). This is described by @cca in his Yet Another Visit to Paxos paper: https://www.zurich.ibm.com/%7Ecca/papers/pax.pdf

kostas
2016-08-29 23:25

stylix
2016-08-29 23:55
has joined #fabric-consensus-dev

jyellick
2016-08-30 13:06
Perhaps @kostas can correct me, but I believe he meant > a total number N that's > 3f *+ 1*

chainsaw
2016-08-30 16:15
has joined #fabric-consensus-dev

kostas
2016-08-30 19:35
@jyellick: Yes, thx for catching this.

hfeeki
2016-08-31 02:14
has joined #fabric-consensus-dev

simon
2016-09-01 09:28
@vukolic: what is the q set used for (if i only have one outstanding request ever)?

simon
2016-09-01 09:29
i understand that the p set records "I sent a commit message", and so if f+1 (or more) replicas claim to have sent a commit message, it is possible that some correct replica executed the request, therefore requiring this request to be executed at its sequence number in the new view.

simon
2016-09-01 09:30
but what function does the q set serve? something about the new primary censoring the request?

vukolic
2016-09-01 09:40
it is about Byzantine replicas lying about the P set when you do not have signatures

vukolic
2016-09-01 09:40
imagine you ahve the case when you had a committed request

vukolic
2016-09-01 09:40
but one replica reports the correct p value

vukolic
2016-09-01 09:40
and the other does not

vukolic
2016-09-01 09:41
how will you know is by the f+1 Q set appearences of the value that was actually (potentially) committed

vukolic
2016-09-01 09:41
if one had signatures then signed P set w/o Q set would do it

simon
2016-09-01 09:41
you mean signatures on viewchange?

simon
2016-09-01 09:42
because that's what we have

simon
2016-09-01 09:42
i think a lot of confusion just lifted

vukolic
2016-09-01 09:42
no signatures on PREPARE

vukolic
2016-09-01 09:43
if you had them you would not need Q set

vukolic
2016-09-01 09:43
without them - you need them because of Byzantine replica, just making up the P set as it wishes

vukolic
2016-09-01 09:43
of course we do not test for this - so it is really an algorithmic attack

vukolic
2016-09-01 09:43
that one well-versed in PBFT could pull out

vukolic
2016-09-01 09:44
with a lot of network control

simon
2016-09-01 09:44
but only f byz replicas can make up their P set

vukolic
2016-09-01 09:44
yes but you r view change reacts on 1 P set

vukolic
2016-09-01 09:44
if 1 replica reports a P set you react on it

vukolic
2016-09-01 09:44
if 1 replica reports P1 and other P2

vukolic
2016-09-01 09:44
without Q set you would not know what to do

simon
2016-09-01 09:44
oh!

simon
2016-09-01 09:44
you mean a single byzantine replica sends two different P-sets?

vukolic
2016-09-01 09:45
no

vukolic
2016-09-01 09:45
1 sends P1

vukolic
2016-09-01 09:45
the other P2

simon
2016-09-01 09:45
okay

vukolic
2016-09-01 09:45
others do not send anythin

vukolic
2016-09-01 09:45
what do you do?

simon
2016-09-01 09:45
i wait until i have at least N-f P-sets

vukolic
2016-09-01 09:45
let's say they are all empty

vukolic
2016-09-01 09:45
all others

vukolic
2016-09-01 09:46
so you have 3 replicas

vukolic
2016-09-01 09:46
1 reports P1

vukolic
2016-09-01 09:46
2nd P2

vukolic
2016-09-01 09:46
3rd nothing

vukolic
2016-09-01 09:46
what do you do?

vukolic
2016-09-01 09:46
yiou can wait for 4th

vukolic
2016-09-01 09:46
4th reports nothing

simon
2016-09-01 09:47
i'd say this is against the assumptions

simon
2016-09-01 09:47
how can 1 report p1, 2 report p2, and 3, 4 report nothing?

vukolic
2016-09-01 09:47
it could but maybe I am not making the right example

simon
2016-09-01 09:48
hmm

simon
2016-09-01 09:48
no you're right

simon
2016-09-01 09:48
it could happen

vukolic
2016-09-01 09:48
it could happen but in this particular case you could tell that the answer was there was no request committed

simon
2016-09-01 09:48
1 is byzantine, 2 received enough prepare messages to send a commit, but 3 and 4 didn't send commits yet

simon
2016-09-01 09:48
yes right

simon
2016-09-01 09:49
only if f+1 (or more) report that they sent a commit message (P-set), the request can have been committed

vukolic
2016-09-01 09:49
let me check sth

simon
2016-09-01 09:49
ok

vukolic
2016-09-01 09:53
ok the above should be the example

vukolic
2016-09-01 09:53
but the views reported by P1 and P2 should be different

vukolic
2016-09-01 09:53
and values should be different

vukolic
2016-09-01 09:54
so P1 could have prepared (v1,view1)

simon
2016-09-01 09:54
so P1:<v:2,seq:5,digest:123>, P2:<v:3,seq:5,digest:abc>

vukolic
2016-09-01 09:54
yes

simon
2016-09-01 09:54
and we're changing to view 4?

vukolic
2016-09-01 09:55
no because P2 might have been committed

vukolic
2016-09-01 09:55
ok so it go like this

vukolic
2016-09-01 09:55
view change for view 4

vukolic
2016-09-01 09:55
VP0: nothing

vukolic
2016-09-01 09:55
VP1: P1:<v:2,seq:5,digest:123>,

vukolic
2016-09-01 09:55
VP2: P2:<v:3,seq:5,digest:abc>

vukolic
2016-09-01 09:56
notice that VP0 saying nothing

vukolic
2016-09-01 09:56
does not mean it did not send PREPARE

vukolic
2016-09-01 09:56
it might but since you do not want Q set you cannot tell here

vukolic
2016-09-01 09:56
so

vukolic
2016-09-01 09:56
you cannot wait for VP3 - might have crashed

vukolic
2016-09-01 09:57
and there is an execution like this in which there is no Byzantine VP

vukolic
2016-09-01 09:57
now you cannot just select P2 because VP2 might be lying

vukolic
2016-09-01 09:58
going for lunch :slightly_smiling_face:

simon
2016-09-01 09:58
okay :slightly_smiling_face:

2016-09-01 14:21
@jeffgarratt has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/r2yict7hmveczewfaqrry2bdwae.

kostas
2016-09-01 14:21
The way we implement P sets today, I cannot quite see how signed PREPAREs (https://hyperledgerproject.slack.com/archives/fabric-consensus-dev/p1472722972000063) would help.

jyellick
2016-09-01 15:36
@simon Do you think you could try to get your simplebft stuff into convergence? That way I can try to hook it into the orderer interface?

jyellick
2016-09-01 15:37
Also, not sure if you've followed https://jira.hyperledger.org/browse/FAB-50 , but the latest updates to Viper allow errors on spurious config options, so I think that it's a good solution to our config problems.

kostas
2016-09-01 15:39
I reviewed the simplebft branch a couple of days ago, and it's still WIP. (Unless work has been done in the interim which hasn't been pushed.)

jyellick
2016-09-01 15:39
I think WIP is fine? Especially as we have this feature branch, I think we should be more on the 'commit early, commit often' strategy?

jyellick
2016-09-01 15:40
And, @kostas how far from pushing to Gerrit are you on Kafka?

kostas
2016-09-01 15:40
Works for me. Was more of a heads up that it's not ready yet, in case you hadn't looked at the code.

jyellick
2016-09-01 15:41
I have looked at the code, saw a lot was unhandled, thought maybe happy path was working though

kostas
2016-09-01 15:42
I got back today, I'll write tests during the next couple of days and then I'll push.

simon
2016-09-02 13:48
i pushed a first version of view change

simon
2016-09-02 13:48
it's always the most ugly code

simon
2016-09-02 13:54
but overall it is in a state that we can now add a lot of tests to make sure that it works

simon
2016-09-02 13:54
deterministic tests!

simon
2016-09-02 13:54
yey

simon
2016-09-02 13:55
@jyellick right now i let the new primary forfeit its position if it doesn't have a request that would be proposed in the new-view

simon
2016-09-02 13:56
i figured that would be easier than doing a fetch-and-retry

jyellick
2016-09-02 13:57
Seems sensible, especially for now. Would need to think on potential attacks via that for a byzantine node to sieze leadership at will

simon
2016-09-02 13:57
yea

simon
2016-09-02 13:58
i think i'll switch to test writing now

simon
2016-09-02 13:59
i wonder what happens if the primary sends an invalid pre-prepare

simon
2016-09-02 13:59
say the primary sent pp for 3, now sends one for 5

simon
2016-09-02 14:00
how do i tell the difference to network outage

simon
2016-09-02 14:00
i think i need some transport layer information

simon
2016-09-02 14:01
no, i don't think that would help

simon
2016-09-02 14:01
i could be disconnected, and reconnect without losing any messages

simon
2016-09-02 14:01
so how do i tell that the primary is faulty

simon
2016-09-02 14:02
i guess timeout?

jyellick
2016-09-02 14:04
On reconnect, I think you would need to do timeout

jyellick
2016-09-02 14:04
Although, I still think ultimately we should come up with a network handshake

simon
2016-09-02 14:05
i guess that primary fault would be handled by a request timeout

jyellick
2016-09-02 14:05
Yes

simon
2016-09-02 14:09
hm, right now i just have a timeout starting from preprepare

simon
2016-09-02 14:09
i guess i should remove that and instead have individual timeouts for all requests received

simon
2016-09-02 14:09
i.e. not per batch, but per request

simon
2016-09-02 14:11
again, the big question, is the view change timer reset when a correct new view message is received, or when the next request commits?

jyellick
2016-09-02 14:27
I never liked the need to commit a request after the new view. I would say the view change timer resets when a correct new view message is received. Then, if there are outstanding requests, the outstanding request timer starts.

jyellick
2016-09-02 14:28
Since we are exploiting ordering, I can go dig up my old work on the request queue stuff. I think the answer is to have a per client (in our case, replica) request timer

simon
2016-09-02 14:29
ah i see

grapebaba
2016-09-05 13:32
hi, i have a question about new architecture. Currently coNsensus service delivery peers will use gossip, right?

grapebaba
2016-09-05 13:35
whatever endorse peer, submitting peer, commtting peer?

muralisr
2016-09-05 14:19
@grapebaba how about we deal with the question on fabric-peer-dev that was created for the purpose of dealing with endorsement/commitment so we can give that channel some advertisement ?

grapebaba
2016-09-05 14:23
oh

grapebaba
2016-09-05 14:24
i have not know that channel

muralisr
2016-09-05 14:25
the topic is relevant here too…fabric-peer-dev was created recently for discussing new architectures non-consensus peers (endorsement/commitment). Just wanted to give it some airtime :slightly_smiling_face:

simon
2016-09-05 14:25
why are there so many channels?

hgabor
2016-09-05 14:40
the more the better

muralisr
2016-09-05 14:41
why indeed … I think categorizing is good in general

muralisr
2016-09-05 14:41
not sure about more the better :slightly_smiling_face:

simon
2016-09-05 14:41
but we're like 10 developers

hgabor
2016-09-05 14:43
@muralisr in general you are right but 5-8 categories would be enough :slightly_smiling_face: I always write into dev-env and CI if I have a problem, then into testing... and then fabric-dev remains :smile:

simon
2016-09-05 14:43
we just need one place where everybody is and evereything gets coordinated

simon
2016-09-05 14:43
yea good luck phasing out all of these channels

hanhzf
2016-09-06 05:12
has joined #fabric-consensus-dev

csehd
2016-09-06 08:36
has joined #fabric-consensus-dev

hgabor
2016-09-06 08:43
@csehd has some interesting errors when bombing fabric with transactions

csehd
2016-09-06 08:44
Moment pls. I'm trying to reproduce the phenomenon

csehd
2016-09-06 09:28
Hy everyone. I have a bug, maybe related with pbft consensus.

csehd
2016-09-06 09:29
I have 4 validating peer with pbft consensus

simon
2016-09-06 09:29
which code version?

csehd
2016-09-06 09:29
aug 19


csehd
2016-09-06 09:30
I will try it with the latest master version as well

csehd
2016-09-06 09:31
I have a workload generator, which stress all peer at the same time at a constant 50 invokes/sec/peer via REST API chaincode endpoint

simon
2016-09-06 09:32
with or without security?

csehd
2016-09-06 09:32
without security

simon
2016-09-06 09:32
what's your batch size set to?

csehd
2016-09-06 09:33
500

simon
2016-09-06 09:33
okay

simon
2016-09-06 09:33
so what happens?

csehd
2016-09-06 09:33
oh. I'm using the example02 chaincode

csehd
2016-09-06 09:34
At the end of the measurement, The ledger has duplicate transactions

csehd
2016-09-06 09:34
so the ledger state shows the future

simon
2016-09-06 09:34
yea, the example02 chaincode does not protect against replays

csehd
2016-09-06 09:35
is it the chancode's role to protect from the duplicates?

simon
2016-09-06 09:36
that's debatable

simon
2016-09-06 09:37
something needs to protect against replays

ikocsis
2016-09-06 09:37
has joined #fabric-consensus-dev

ikocsis
2016-09-06 09:37
hi all

ikocsis
2016-09-06 09:39
Simon, that means then that the fabric is "allowed" to retry requests and it is the express duty of the chaincode to recognize whether a duplicated attempt is being made?

ikocsis
2016-09-06 09:39
Or am I misunderstanding what you are saying?

simon
2016-09-06 09:39
i don't think this is defined anywhere

simon
2016-09-06 09:39
so it is wild west

simon
2016-09-06 09:39
if you use utxo, there are no successful replay attacks

simon
2016-09-06 09:40
if you don't - you better think about which replay attacks you want to prevent

ikocsis
2016-09-06 09:42
The weird thing is - I think we don't do replays; bunch of requests go in (each getting an ID and we record these), but no duplication and still we see duplicated instances of these txIDs in the ledgers

ikocsis
2016-09-06 09:42
So the "replay attack", if you like, is a courtesy of the consensus

simon
2016-09-06 09:42
yes

ikocsis
2016-09-06 09:42
Nice

simon
2016-09-06 09:43
but you will have to protect against it anyways

ikocsis
2016-09-06 09:43
Well, that is true, now that I think about it

simon
2016-09-06 09:43
let's say consensus was completely correct

simon
2016-09-06 09:43
but one of the peers was byzantine

simon
2016-09-06 09:43
it could just duplicate your request

ikocsis
2016-09-06 09:44
yep, right

hgabor
2016-09-06 09:44
in this case these where non-byzantine peers, and @csehd sent in every tx just once. isn't it a bug this way?

ikocsis
2016-09-06 09:46
hgabor, yes, it would be still nice to know that what the heck happens in this specific case consensus-wise

simon
2016-09-06 09:46
yes, there are bugs

deeflorian
2016-09-06 09:46
has joined #fabric-consensus-dev

simon
2016-09-06 09:47
the problem is that in the original pbft, the client waits until its request is processed

ikocsis
2016-09-06 09:47
but this means that right now we are not working against the assumptions that fabric makes wrt the chaincode

simon
2016-09-06 09:47
and then sends another one, sequentially

simon
2016-09-06 09:47
and if it takes too long, it probably re-sends its request, under the assumption that the network lost it

simon
2016-09-06 09:48
the protocol prevents this replay from being accepted by means of a sequence number per client

simon
2016-09-06 09:48
in fabric, the peers act as "client"

simon
2016-09-06 09:48
so then you have the choice between sending one request per peer, sequentially

simon
2016-09-06 09:48
effectively ruining performance

simon
2016-09-06 09:49
or hoping that requests will be processed sequentially (optimistically sending in multiple)

simon
2016-09-06 09:49
which leads to requests being skipped because of message reordering

simon
2016-09-06 09:50
or accepting that some requests duplicate by mistake

simon
2016-09-06 09:50
we have an idea of how to address this, by introducing virtual clients for every peer

simon
2016-09-06 09:50
but that has not been implemented, and likely won't be implemented in the current code

ikocsis
2016-09-06 09:54
Simon, thanks - personally I have to process this it a bit; the takeaway is that chaincode02 has to be fixed for protection against replay.

hgabor
2016-09-06 09:54
as I remember none of the examples is replay protected

hgabor
2016-09-06 09:55
they are just toys

ikocsis
2016-09-06 09:55
Is there any documentation that summarizes such requirements on the chaincode? (And maybe patterns for fulfilling them.)

simon
2016-09-06 09:56
no

ikocsis
2016-09-06 09:56
It's no problem that they are toys, but a big warning sign that "weird things will happen if you use this" would be nice :slightly_smiling_face:

hgabor
2016-09-06 09:57
we should add one

simon
2016-09-06 09:57
haha

hgabor
2016-09-06 09:58
btw one can use GetTxByTxID or something to check if the txid is already used

simon
2016-09-06 09:59
this whole project doesn't have defined behaviour or design

deeflorian
2016-09-06 10:01
still i think we can agree that it'd be weird if a request dropped in through the rest api got duplicatied

hgabor
2016-09-06 10:02
but somehow one got duplicated :slightly_smiling_face:

deeflorian
2016-09-06 10:03
and mitigating the responsibility of TxID checks to the chaincodes themselves seems a bit hackish

deeflorian
2016-09-06 10:04
...i'm not that up to date on the current status on the TxID generation, but there was a few mails on the list that the TxID should be generated from the hash of the payload

hgabor
2016-09-06 10:04
yes and it can be generated from it

hgabor
2016-09-06 10:04
I mean from the payload

deeflorian
2016-09-06 10:04
that'd contain the TS too?

hgabor
2016-09-06 10:05
TS?

csehd
2016-09-06 10:05
timestamp

deeflorian
2016-09-06 10:05
(timestamp on the api receiver end)

hgabor
2016-09-06 10:05
currently not

hgabor
2016-09-06 10:05
currently it does not contain that

ikocsis
2016-09-06 10:05
@simon, we see that (@deeflorian spent quite a bit of quality time with the code base and our as of now in-house model is still far from complete)

deeflorian
2016-09-06 10:06
yepp, but if it's a feature in a work-in-progress or a to-be-done state, then the "bug" will resolve itself

hgabor
2016-09-06 10:09
there was a debate on the list about what we should include in that hash. and some people said we should include a timestamp and/or a nounce. so yes, if we make that happen then it will be ok

deeflorian
2016-09-06 10:09
though if i'm wrong, the _big warning sign_ is a nice temporary fix but further discussion might be needed

hgabor
2016-09-06 10:09
but as I remember we agreed that hashing will only be modified in fabric new architecture

simon
2016-09-06 10:11
i'm happy to discuss possible solutions

deeflorian
2016-09-06 10:11
that'd still mean that this is just a temporary feature in a pre-1.0 release... kind of a relief :slightly_smiling_face:

hgabor
2016-09-06 10:12
@deeflorian what do you thing about including a nounce?

simon
2016-09-06 10:12
and this needs to be part of the architecture discussion

csehd
2016-09-06 10:13
calculate with the clock-drift between peers

simon
2016-09-06 10:14
what are we talking about?

csehd
2016-09-06 10:14
I'm talking about including the nounce in the txid generation.

simon
2016-09-06 10:15
well but how does that help

simon
2016-09-06 10:16
you'd need a O(1) database to check for existing transactions

simon
2016-09-06 10:16
which needs quite some space

hgabor
2016-09-06 10:16
@simon how do we define if two transactions are the same?

simon
2016-09-06 10:17
yes :slightly_smiling_face:, that's how it starts

simon
2016-09-06 10:17
which replays do you want to prevent

deeflorian
2016-09-06 10:17
would give it a thumbs up on my part, but i think this is something that should be discussed on an architectural meeting? (and checking the generated nounces for possible duplicate messages should then be implemented on the consensus level?)

ikocsis
2016-09-06 10:18
guys, I have to bail (something urgent came up) - @csehd, @deeflorian: I am happy with any true solution, however hackish it may be

deeflorian
2016-09-06 10:19
...based on the last comments, this might even involve the requirements wg :smile:

hgabor
2016-09-06 10:19
@deeflorian I just realized some minutes ago that the duplication happens on consensus level so the issue may be harder, or am I wrong?

csehd
2016-09-06 10:21
@hgabor You are right. In my case, it is on Consensus level

hgabor
2016-09-06 10:21
@simon case1: I only want to filter duplicated that arise from the multiple invocation of chaincodes with the same parameters (or... is that a duplicate?) case2: I also want to filter consensus level duplicates. what can I do?

simon
2016-09-06 10:22
well, case1 includes case2

deeflorian
2016-09-06 10:23
i'd guess so (duplication in the execution queue or further down that pipe would affect other peers a lot more difficultly)

hgabor
2016-09-06 10:23
@simon it includes, because if it comes from the consensus, that will generate a call to the chancode, right?

deeflorian
2016-09-06 10:25
@hgabor on case1, I'd say that unless the CC also requires a unique id / ts from the client (which would be weird), the same parameters are quite probable

deeflorian
2016-09-06 10:25
(tx from A to B in a bank, same amount, same comment)

hgabor
2016-09-06 10:28
@deeflorian yes they are, and others say the some. however I/we think that in a production system it is much more likely that the bitcoin technique will be used: inputs. those will easily stop having duplicates

hgabor
2016-09-06 10:28
inputs and outputs, some kind of chain, you know what I mean

simon
2016-09-06 10:29
well, but if you do utxo, you already have replay protection

hgabor
2016-09-06 10:30
yes that is what I say :smile:

hgabor
2016-09-06 10:30
and you keep it lightweight

hgabor
2016-09-06 10:30
no need for general protection, you have a domain specific one

simon
2016-09-06 10:30
also with the new architecture, the chaincode is executed first

simon
2016-09-06 10:31
and then the result goes through consensus

hgabor
2016-09-06 10:31
and the changeset is obtained

simon
2016-09-06 10:31
and that one will have replay protection just by using MVCC

hgabor
2016-09-06 10:32
is it an option not to use the mvcc?

simon
2016-09-06 10:32
no

hgabor
2016-09-06 10:32
@deeflorian @csehd I don't want to hijack the topic :slightly_smiling_face:

simon
2016-09-06 10:33
so that reduces the problem to "proposal" replays before going to the endorser

deeflorian
2016-09-06 10:35
not a problem :slightly_smiling_face: what matters is that consensus level duplicates should not affect the ledger state, and this is just a temporary bug which will be then fixed as a side effect of the new architecture

hgabor
2016-09-06 10:36
by that you mean that a tx is proposed multiple times (and endorsed)?

deeflorian
2016-09-06 10:36
nope, that part of the sentence was related to the current state

deeflorian
2016-09-06 10:38
if it's proposed _and_ endorsed multiple times...

hgabor
2016-09-06 10:38
I mean simon's question :slightly_smiling_face:

deeflorian
2016-09-06 10:42
the reduction to proposal replay? i think that was a statement :smile:

hgabor
2016-09-06 10:43
how can such a proposal replay be prevented?

deeflorian
2016-09-06 10:44
not sure yet, i'll jump to the sidelines on the topic and dig a bit deeper into the new arch

simon
2016-09-06 10:52
hgabor: yes

bryan-huang
2016-09-06 10:52
has joined #fabric-consensus-dev

simon
2016-09-06 10:53
hgabor: well, i don't know. the crypto people have been asking about this

simon
2016-09-06 10:53
you can easily protect against it using utxo in the chaincode, or something similar

simon
2016-09-06 10:53
:slightly_smiling_face:

deeflorian
2016-09-06 11:21
just as a sidenote -- we've discussed the possible solutions for functionally correct measurements and will go with a custom CC that filters the duplicates, for the time being (much like the marbles example)

simon
2016-09-06 12:16
okay

simon
2016-09-06 14:34
oh our scrum now is an hour early

tuand
2016-09-06 14:35
cancelled today

simon
2016-09-06 14:35
oh

simon
2016-09-06 14:35
okay

simon
2016-09-06 15:59
jyellick: around?

simon
2016-09-06 15:59
or @kostas

jyellick
2016-09-06 15:59
I am

jyellick
2016-09-06 15:59
Kostas should be available later, but is busy at the moment (@simon)

simon
2016-09-06 16:00
hi

simon
2016-09-06 16:00
so i got view change working sufficiently well

simon
2016-09-06 16:00
and i'd like to move on to retaining state and state transfer

simon
2016-09-06 16:00
also, i really like the tendermint bft

jyellick
2016-09-06 16:01
With regards to retaining state, I'm currently working on a 'rawledger' interface

jyellick
2016-09-06 16:03
As a first step, I've factored out the simple ram based ledger I did for Solo, and am wrapping it in more sane interfaces, the thought being once it is there, I can quickly hack something with actual data persistence (though ultimately, we should come up with a non-hacky solution)

jyellick
2016-09-06 16:04
Then, I figured I could take the new rawledger stuff and hook it into your simplebft work

jyellick
2016-09-06 16:04
Last week when I'd looked, it seemed like it wasn't quite ready for it, so I figured hacking off something that already existed and I was familiar with (solo) made sense

simon
2016-09-06 16:14
yes

simon
2016-09-06 16:14
man that stupid raw ledger

simon
2016-09-06 16:14
i still feel it is the wrong interface

simon
2016-09-06 16:15
but oh well

simon
2016-09-06 16:15
i'd validate transactions during consensus, and reject invalid ones

jyellick
2016-09-06 16:25
It all really comes down to semantics. We say there's an unvalidated ledger, and a validated one, but in reality, we just have two blockchains, with different validity constructs. We've chosen to create one chain, and use it to build another, but, you could do this an arbitrary number of times. It's really just arbitrary transformations of one chain into another. It so happens the first chain is very stupid/simple and contains arbitrary bytes, while the second chain deals with MVCC stuff and other validity hooks.

simon
2016-09-06 16:25
yes exactly

kostas
2016-09-06 16:44
@simon: are you clear on the need for Q sets now?

kostas
2016-09-06 16:45
(Going back to last week's convo with Marko.)

simon
2016-09-06 16:47
yes

simon
2016-09-06 16:48
right now i can explain the need for them

simon
2016-09-06 16:48
but i think it is no longer a set, but a single item

kostas
2016-09-06 16:48
Do you want take a crack at an example that shows the need for them?

simon
2016-09-06 16:51
yes, there is a unit test for it

simon
2016-09-06 16:52
brb to tell you how it works

simon
2016-09-06 16:58
ok


simon
2016-09-06 17:00
kostas so there are are 3 scenarios, where R1 is prepared for seqno 1, then a view change occurs, and then another primary who didn't hear about this pre-prepare (or did), will act differently

kostas
2016-09-06 17:01
Alright, so I'm looking at all of the tests in that file then?

simon
2016-09-06 17:10
yes

simon
2016-09-06 17:16
and you can see that the p-sets are the same, but the q-sets are different

kostas
2016-09-06 17:18
Roger, will review.

simon
2016-09-06 17:24
ok

simon
2016-09-07 14:14
so now i have a queue of "future" messages (not limited yet), which allows nodes that have a slow link to still succeed

simon
2016-09-07 14:15
it is a bit crude tho, subject to DoS

oiakovlev
2016-09-07 18:48
has joined #fabric-consensus-dev

ynamiki
2016-09-08 01:39
has joined #fabric-consensus-dev

lin
2016-09-08 04:26
has joined #fabric-consensus-dev

lbonniot
2016-09-08 07:21
has joined #fabric-consensus-dev


jyellick
2016-09-08 14:21
@simon The reason why I do not support signing checkpoints, and therefore only periodically signing batches/blocks is that the intermediate batches/blocks have absolutely no value until the checkpoint comes through

simon
2016-09-08 14:22
yes

simon
2016-09-08 14:22
i agree

jyellick
2016-09-08 14:22
If we only sign checkpoints, then I think checkpoint should equal batch/block, which I actually like.

jyellick
2016-09-08 14:24
It seems like commits would still have value, if each commit carried up to batchSize/K messages

simon
2016-09-08 14:24
but then i have the problem that if i catch up to a checkpoint, i might be out of date and cannot continue

simon
2016-09-08 14:24
i need to wait for the next upcoming checkpoint

jyellick
2016-09-08 14:25
Certainly it adds complexity to the protocol, no doubt about it. Since we are building a hash chain anyway, checkpointing is basically free. We could simply combine commit/checkpoint into a single message and sign that.

jyellick
2016-09-08 14:26
The advantage to keeping them as distinct messages, is that a checkpoint message today guarantees that the block has actually been committed, whereas a commit says nothing of the sort.

jyellick
2016-09-08 14:26
But it obviously requires the 4th phase

simon
2016-09-08 14:27
yea, you only need f+1 checkpoints

simon
2016-09-08 14:27
signed checkpoints

jyellick
2016-09-08 14:33
I am more thinking of the problem we had with Sieve. The primary would send out a message with signed proof (essentially commit messages) from 2f+1 replicas. For a replica which needed to do state transfer, it had to gamble as to whether or not the replica it chose had actually committed that block yet or not.

simon
2016-09-08 14:44
yes

simon
2016-09-08 14:44
indeed

simon
2016-09-08 14:44
so we keep the 4th phase

simon
2016-09-08 14:47
so i guess i need to modify my batch definition to include the prev batch hash

simon
2016-09-08 14:47
and then if a primary proposes a batch with incorrect prev batch hash, it is considered byzantine

jyellick
2016-09-08 14:49
Sounds correct to me

shiseki
2016-09-09 05:04
has joined #fabric-consensus-dev

simon
2016-09-09 10:55
sorry guys, i got sick last night - can't concentrate or work today

hgabor
2016-09-09 10:59
get well soon :slightly_smiling_face:

deeflorian
2016-09-09 14:17
ehm... guys... any major changes happened to consensus lately? we've put together a CC that protect against self-made "replay attacks" (essentially it simulates users A and B, A has a giant deck of cards, and transactions are actions where a single card is given to B). we use a heavier workload (few hundred tx/s/peer), and at a given time, things look like this in the pbft module:

deeflorian
2016-09-09 14:17

deeflorian
2016-09-09 14:17
after a view change:

deeflorian
2016-09-09 14:17

deeflorian
2016-09-09 14:21
which looks like a scenario where the primary is already in the next view, but it looks like it's still performing it's tasks (small number of rejects, many transactions are written to the ledger). and this happens during the entirety of the experiment... kind of dazzled at the moment... and what's more, the CC on the core peer is dormant during the period where the other peers think it's the primary (~0% CPU usage from a CC that generates a quite heavy workload)

deeflorian
2016-09-09 14:22
not saying that a change _has_ happened, but this seems like a fascinating phenomenon :smile:

jyellick
2016-09-09 14:39
@deeflorian No real changes to PBFT have made it in lately. That is pretty baffling behavior. In general, all leader batch sends are wrapped at the very least in a check against `isActiveView`. Would love to see some logs with PBFT debugging enabled if you are able to reproduce.

jyellick
2016-09-09 14:40
My best guess would be that vp1 successfully sent the new view message, but the other peers were not able to process it quickly enough, leading `vp1` to view change again. You would see some transactions processed, and then it would sit in this state until there were new outstanding transactions, which should cause another view change. If this is the case, then I would expect that the view change timeout and perhaps request timeout should be tuned up.

deeflorian
2016-09-09 15:11
thanks a lot, that's reassuring and constructive :slightly_smiling_face: maybe we're just overloading the execution/chaincode part of the system too heavily (with the CC, could go a lot higher with other CCs, just we used that with a few week old version)... based on what you said, i'd _guess_ that the outstanding queue might be too large and cause the primary to request a view change almost instantly, while the others "catch up" and process the sent out batches. will do a few experiments and report what's found, but this is might be too much of a Friday afternoon to explore all of the options now. this is definitely reproducible though, so I'll link the log when available

jyellick
2016-09-09 15:14
Great, thanks @deeflorian !

deeflorian
2016-09-09 16:19
@jyellick looks like this is only the result of melting down the pbft with requests :smile: didn't look at the outstanding queue size previously, but using the same stress level with a slower CC is too much. still, this means that longer CC execution times can lead to quite strange behaviour, observable in the pbft variables... interesting :slightly_smiling_face:

jyellick
2016-09-09 17:11
@deeflorian Ah, yes, that could definitely have an effect. If your executions are taking a long time, then I would recommend increasing the request timeout from the default of 2s to something like, maybe 10s.

rajeshsubhankar
2016-09-10 05:41
has joined #fabric-consensus-dev

rafael
2016-09-10 16:44
has joined #fabric-consensus-dev

vikas.singh
2016-09-11 05:57
has joined #fabric-consensus-dev

simon
2016-09-12 09:11
still sick today :confused:

garisingh
2016-09-12 09:54
sorry @simon . get better soon

tuand
2016-09-12 13:08

vukolic
2016-09-12 13:23
another shiny consensus protocol?

vukolic
2016-09-12 13:28
I'd like to see all these new protocols published properly in appropiate research venues

vukolic
2016-09-12 13:28
just like is the case with Paxos, Raft, PBFT and others

cca
2016-09-12 15:15
@tuand This and many others remind me a lot of past discussions on crypto "snake oil" (read Schneier's post, it has a longer version of what I'm writing here, https://www.schneier.com/crypto-gram/archives/1999/0215.html): Someone claims to have a superb new algorithm but fails to explain or formally demonstrate how it is superior to the experts in the field. There is a common theme: cryptography and resilient protocols have to withstand a class of attacks, the protocol or my cryptosystem may run *much* faster if there is no attack, but it's hard to demonstrate which attacks/situations it survives. Essentially this requires mathematical arguments. Unlike, say, designing a faster network, which demonstrates its feature by operation and measuring the speed. In crypto and consensus protocols you can only demonstrate the *failure* of achieving a claimed goal, and such an attack is hard work. Therefore the protocol has to be peer-reviewed just like a cryptosystem. Othewise just don't bother.

ganesh47
2016-09-12 17:08
has joined #fabric-consensus-dev


grapebaba
2016-09-13 12:06
these two documents has some inconsistence

grapebaba
2016-09-13 12:06
the diagram in the pdf not have submitting peer

grapebaba
2016-09-13 12:07
which one is the latest design

grapebaba
2016-09-13 12:12
@grapebaba uploaded a file: https://hyperledgerproject.slack.com/files/grapebaba/F2B32L70C/gossip_fabric_v4.pptx and commented: also in the Next-Consensus-Architecture-Proposal.md seems not display the gossip communication in

grapebaba
2016-09-13 12:15
can we have these design in one guide?

vukolic
2016-09-13 13:13
@grapebaba good point - synchronization of these two documents is on the way - will post here (and elsewhere) as soon as the consolidated design is available

vukolic
2016-09-13 13:13
thanks for the patience

garisingh
2016-09-13 16:31
@jyellick - you around?

jyellick
2016-09-13 16:32
I am

garisingh
2016-09-13 16:32
so trying to figure out getting the change sets you posted for solo and selecting ledger interface reviewed and merged

jyellick
2016-09-13 16:33
I saw Kostas had some comments, was going to address them

jyellick
2016-09-13 16:33
But need to get those fixes from over the weekend into 0.5 and master for 0.6 today


garisingh
2016-09-13 16:33
okay - makes sense

jyellick
2016-09-13 16:34
Ah, damn, I thought gerrit was smarter than that

jyellick
2016-09-13 16:34
I can rebase the patch series

garisingh
2016-09-13 16:35
no rush - but looked like interesting code for people to start using :wink:

jyellick
2016-09-13 16:36
Thanks, at least the very first blush solo stuff is in there, so people can at least explore an implementation of the proto api, but splitting out the ledger implementations should ultimately help us hook in the work the ledger crew is doing as well as give us something for pbft to hook into now that it's no longer in the peer

jyellick
2016-09-13 18:18
@kostas @garisingh @tuand @simon @sanchezl @jeffgarratt I've submitted the fixes found this weekend to master to hopefully get in before the 0.6 cut, https://gerrit.hyperledger.org/r/#/c/1039/ https://gerrit.hyperledger.org/r/#/c/1041/ The patches didn't apply as cleanly as I'd hoped, so please review carefully

garisingh
2016-09-13 18:31
sketchy code hehe

jyellick
2016-09-13 19:24
(Just posted an update to patch 2 which substantially cleans up the behave stuff)

simon
2016-09-14 13:00
so i wanted to work on the persistence code for sbft

simon
2016-09-14 13:00
but i can't really concentrate

simon
2016-09-14 13:00
just blankly staring at the screen

garisingh
2016-09-14 13:06
is it staring back?

wlahti
2016-09-14 13:42
has joined #fabric-consensus-dev

simon
2016-09-16 09:10
this got stuck in a private chat:

simon
2016-09-16 09:11
so i think we need to persist: 1. last checkpoint certificate, 2. last request, 3. last fact we sent prepare, 4. last fact we sent commit, 5. most recent "execute", i.e. persisted block. i don't know whether we need 6. last checkpoint message we sent ourselves because we can always reproduce the checkpoint message so (1) allows to sync others to us (2) allows the network to restart after crash during a round (3) is Q set (4) is P set (5) is blockchain/app state

simon
2016-09-16 09:57
2+3 is actually last sent/received pre-prepare

garisingh
2016-09-16 10:08
@simon - is this for the "ordering" nodes? (Might have missed the beginning of the chat)

simon
2016-09-16 10:08
yes

simon
2016-09-16 10:09
for the simplified pbft rewrite


simon
2016-09-16 10:09
this

garisingh
2016-09-16 10:09
so would 5) really be the last thing that the ordering node "broadcast"?

simon
2016-09-16 10:10
oh this numbering is arbitrary and just refers to state that the replica needs to persist across restarts

simon
2016-09-16 10:10
the sequence is

simon
2016-09-16 10:10
request -> primary

simon
2016-09-16 10:10
primary: preprepare

simon
2016-09-16 10:11
everybody else: prepare

simon
2016-09-16 10:11
everybody: commit

simon
2016-09-16 10:11
everybody: checkpoint

simon
2016-09-16 10:11
repeat

simon
2016-09-16 10:11
so it is a 4 phase protocol with one signed message

garisingh
2016-09-16 10:12
right - so 1), 2), 3), 4) in your list make sense. I just was not sure what you meant by "most recent execute"

simon
2016-09-16 10:12
so after i receive a quorum of commits (or checkpoints, to be debated), i "execute", i.e. add the block to the local chain (app state)

garisingh
2016-09-16 10:13
okay - cool. makes sense.

simon
2016-09-16 10:13
most recent execute is the same as "highest block"

garisingh
2016-09-16 10:17
and would we store this separate to storing whatever block history we decide? So that if I simply backed up these pieces of information (for example if the actual hardware running the node died) I could restart the node on another machine with this info and rejoin?

simon
2016-09-16 10:20
yes

kostas
2016-09-16 10:21
(By the way, I'm still investigating this. I'm not getting the proper block height upon restarting with the transferred state, so I'm checking with Manish whether there is a proper way to shutdown the good peer so that its memory contents are captured to disk.)

simon
2016-09-16 10:21
although backup is difficult, because different files are backed up at different times

simon
2016-09-16 10:21
hi kostas - early morning?

simon
2016-09-16 10:21
kostas: of the peer?

kostas
2016-09-16 10:21
Hello - yes.

kostas
2016-09-16 10:22
Yes, I'm basically trying out this manual state transfer scenario that Gari alludes to.

kostas
2016-09-16 10:22
But the lastExec that the resurrected peer reports is not right.

simon
2016-09-16 10:22
is this different from killing the process and restarting?

kostas
2016-09-16 10:23
That's what I do but it doesn't work.

simon
2016-09-16 10:23
kostas: is that with pbft or with kafka?

kostas
2016-09-16 10:23
With PBFT. We're talking about the 0.5 (now: 0.6) branch.

kostas
2016-09-16 10:25
So in pbft-persist, we restore lastSeqNo and that info is not what I wanted it to be for VP3 (the peer that received the state of VP2).

kostas
2016-09-16 10:25
But if you stop and restart VP2, the right lastSeqNo is reported. So that means that the peer is shutdown properly.

simon
2016-09-16 10:30
ah

simon
2016-09-16 10:30
yes, it is entirely possible that there is a bug in that code path

simon
2016-09-16 10:31
how do you stop vp2?

simon
2016-09-16 10:31
is there a graceful stop option?

kostas
2016-09-16 10:32
There is and I'm trying this now.

simon
2016-09-16 10:35
lastSeqNo should come from the consensusmetadata field of the block?

kostas
2016-09-16 10:41
Correct.

kostas
2016-09-16 10:44
As for your earlier points w/r/t what needs to be persisted in your simplified BFT work, are you sure about #3? (Qset = _last_ fact we sent prepare)?

kostas
2016-09-16 10:44
That would imply that the Qset is a single-item list, whereas Figure 3 in the original Castro paper suggests you may well have <10, bar, 3> and <10, baz, 4>, in addition to your most recent <10, foo, 5> prepare. (<n, d, v> notation)

simon
2016-09-16 10:46
yes, but you asked me whether this applied

kostas
2016-09-16 10:46
Correct, and I concluded that this does apply.

simon
2016-09-16 10:47
right now we don't send more than one preprepare anyways

simon
2016-09-16 10:47
in the view change

simon
2016-09-16 10:47
|Q| <= 1

simon
2016-09-16 10:48
certainly there must be a bug in there

kostas
2016-09-16 10:48
Right now, as in the new sBFT work?

simon
2016-09-16 10:48
yes

kostas
2016-09-16 10:48
OK, I'm suggesting that this might need to be reconsidered.

simon
2016-09-16 10:49
can you define a scenario where this is required?

simon
2016-09-16 10:49
that would help a lot

simon
2016-09-16 10:49
my intuition is that it doesn't apply

simon
2016-09-16 10:49
because a view change resolves all requests

simon
2016-09-16 10:50
because there is just one

kostas
2016-09-16 12:25
I am going through the paper, and I think you're right. Condition A1 ensures that the primary selects a request (for pre-prepare in the new view) that some replica in a quorum claims to have prepared in the latest view, or it's a null-request. Following that logic, you cannot have <10, bar> prepared in view 4 and <10, foo> pre-prepared in view 5; the latter implies that <10, foo> prepared in view 4 which contradicts with <10, bar, 4> preparing.

simon
2016-09-16 12:58
maybe @vukolic has a better explanation

vukolic
2016-09-16 13:01
guys

vukolic
2016-09-16 13:01
I read

vukolic
2016-09-16 13:01
but failed to get a TL DR

vukolic
2016-09-16 13:01
can smbd pls summarize?

vukolic
2016-09-16 13:03
as for "ou cannot have <10, bar> prepared in view 4 and <10, foo> pre-prepared in view 5"

vukolic
2016-09-16 13:03
actually you can

kostas
2016-09-16 13:03
Can you give me a sequence that would result in this?

vukolic
2016-09-16 13:03
sure

vukolic
2016-09-16 13:04
primary sends <PRE-PREPARE,10,bar> to all

kostas
2016-09-16 13:04
I was about to write: "...which brings up the question, why on earth do we keep a list of items in the Qset in the original PBFT paper." (There must be something I'm missing.)

vukolic
2016-09-16 13:04
all send <PREPARE,10,bar>to all

vukolic
2016-09-16 13:04
but only primary receives PREPARES

vukolic
2016-09-16 13:04
so primary prepares <10,bar>

vukolic
2016-09-16 13:04
after a complete network breakdown

vukolic
2016-09-16 13:04
new leader is elected

vukolic
2016-09-16 13:04
vp1

vukolic
2016-09-16 13:05
but now old leader (say vp0) is partitioned

vukolic
2016-09-16 13:05
it cannot report 10,bar from view 4

vukolic
2016-09-16 13:05
and new primary proposes whatever he wants

vukolic
2016-09-16 13:05
which is 10,foo in view 5

vukolic
2016-09-16 13:05
QED :slightly_smiling_face:

vukolic
2016-09-16 13:07
as for Qsets

kostas
2016-09-16 13:07
Do we agree that all the other nodes, participating in view 5 have <10, bar, 4> in their Qset?

vukolic
2016-09-16 13:07
yes in this example

vukolic
2016-09-16 13:07
I could have a simpler one

vukolic
2016-09-16 13:07
in which they don't

vukolic
2016-09-16 13:08
at least not all of them

kostas
2016-09-16 13:08
In your example above, wouldn't the new leader then assign the null-request to seqNo 5?

vukolic
2016-09-16 13:08
but only one

vukolic
2016-09-16 13:09
remind me - are null requests coming from PBFT paper or is this our own invention?

kostas
2016-09-16 13:09
PBFT paper.

vukolic
2016-09-16 13:09
pointer?

kostas
2016-09-16 13:10
Sure, pg. 412 of the TOCS version, last paragraph. (pg. 15 of the PDF)

vukolic
2016-09-16 13:14
conf call starting will come back

vukolic
2016-09-16 13:15
ok so yes - there is null in this version

vukolic
2016-09-16 13:16
you are right so <10,foo> is not possible but <10,no-op> is

kostas
2016-09-16 13:16
Exactly, thanks.

vukolic
2016-09-16 13:17
so what is the optimization you propose?

kostas
2016-09-16 13:18
So if Simon is doing the no-op thing in his SBFT work, then can we claim that he doesn't need a list for the Qset? He'll only be storing a single item there.

vukolic
2016-09-16 13:18
was the list because of watermarks?

kostas
2016-09-16 13:18
(Basically, there's no point in keeping <10, bar, 3> and <10, baz, 4> around.)

kostas
2016-09-16 13:20
To answer that, I'll have to be convinced of the use of a list for the Qset in the PBFT paper to begin with. (Short answer right now: I don't really know.)

kostas
2016-09-16 13:20
I remember your conversation with Simon here a couple of weeks ago, but I don't think that example was fully worked through; at least it didn't make sense to me.

vukolic
2016-09-16 13:39
ok

vukolic
2016-09-16 13:39
back

vukolic
2016-09-16 13:40
is the question is should without watermarks Q set be a single value?

vukolic
2016-09-16 13:40
and not a set?

simon
2016-09-16 13:41
yes

simon
2016-09-16 13:41
correct

kostas
2016-09-16 13:41
I claim that if you do the null-request thing, it's definitely a single value.

vukolic
2016-09-16 13:41
ok so there are two things

vukolic
2016-09-16 13:41
even without watermarks

vukolic
2016-09-16 13:41
one could have sth called pipelining

vukolic
2016-09-16 13:42
in which I as a leader

vukolic
2016-09-16 13:42
send PRE_PREPARE for seqno=10

vukolic
2016-09-16 13:42
but I do not wait for that to commit to start seqno=11

vukolic
2016-09-16 13:42
I just send the PRE-PREPARE for seqno=11

vukolic
2016-09-16 13:42
and so on

vukolic
2016-09-16 13:43
(this looks like watermarks - but t is not)

vukolic
2016-09-16 13:43
namely

vukolic
2016-09-16 13:43
on the reception side - pipelining mandates that followers process requests in order

vukolic
2016-09-16 13:43
but also in pipeline

vukolic
2016-09-16 13:44
like to send PREPARE for seqno=12

vukolic
2016-09-16 13:44
replcia would need to send PREPARE for seqno=11

vukolic
2016-09-16 13:44
but not commit seqno=11

vukolic
2016-09-16 13:44
you see what I mean

vukolic
2016-09-16 13:44
so you eliminate watermarks

kostas
2016-09-16 13:44
(With you so far.)

vukolic
2016-09-16 13:44
but still have "full pipe"

vukolic
2016-09-16 13:45
in this case you would still need Q to be a set

vukolic
2016-09-16 13:45
but the set is there only because there are multiple requests in flight (albeit pipelining is diff from watermarks)

vukolic
2016-09-16 13:45
now

vukolic
2016-09-16 13:45
if the question is

vukolic
2016-09-16 13:46
how many values are ther in the Q set that have the same sequence number and replica ID

vukolic
2016-09-16 13:46
the answer is always - at most one

vukolic
2016-09-16 13:46
if the question is

vukolic
2016-09-16 13:46
how many values are there in the Q set that have the same sequence number - the answer is again a set

vukolic
2016-09-16 13:46
because due to Byzantine leader

kostas
2016-09-16 13:47
Are these answers pipeline-specific only?

vukolic
2016-09-16 13:47
different replicas can have different values

vukolic
2016-09-16 13:47
no

vukolic
2016-09-16 13:47
after "now" there is nothing pipeline specific

vukolic
2016-09-16 13:47
so strictly speaking

vukolic
2016-09-16 13:47
Q set at the leader

vukolic
2016-09-16 13:48
when it decides how to select a value for given seqNo must be a set because of you can report <kostas, 10, foo> and me <marko,10,bar>

vukolic
2016-09-16 13:48
and both of us are correct

vukolic
2016-09-16 13:48
because the leader was Byz

vukolic
2016-09-16 13:49
I realize now that pipeline does not matter for this argument - but it is good we had it because I anyway wanted to actually tell you guys that this is diff from watermarks

vukolic
2016-09-16 13:49
:slightly_smiling_face:

simon
2016-09-16 13:51
ah yes

simon
2016-09-16 13:51
so the Qset in the viewchange is a single item

kostas
2016-09-16 13:51
Right, do we agree on that?

vukolic
2016-09-16 13:51
in viewchange msg

simon
2016-09-16 13:51
but the new primary needs to be able to deal with Qsets that refer to different digests

vukolic
2016-09-16 13:51
yes

kostas
2016-09-16 13:52
Exactly.

simon
2016-09-16 13:52
yep, we have that

vukolic
2016-09-16 13:52
but only w/o pipelining and w/o watermarks

simon
2016-09-16 13:52
yep

vukolic
2016-09-16 13:52
if you have pipelining

vukolic
2016-09-16 13:52
it is again a set

simon
2016-09-16 13:52
i think we need to have a simple working bft first

vukolic
2016-09-16 13:52
but with no two values being the same for the same seqno

vukolic
2016-09-16 13:53
we are talking about VIEW-CHANGE msg only

vukolic
2016-09-16 13:53
for Fig 3 logic at the leader

vukolic
2016-09-16 13:53
it is always a set

simon
2016-09-16 13:54
and my expectation is that batching helps so much, and the rest of the system is so slow anyways

vukolic
2016-09-16 13:57
you can do things without pipelining if you want testing first

vukolic
2016-09-16 13:58
but eventually we will want pipelining (but not watermarks) - so just have that in the back of the mind

simon
2016-09-16 13:58
every extra conditional makes it so much harder to reason about

vukolic
2016-09-16 13:59
pipelining should not be difficult - it is just not blocking on the commit

vukolic
2016-09-16 13:59
but you have two sequence numbers

vukolic
2016-09-16 13:59
commit

vukolic
2016-09-16 13:59
and process

vukolic
2016-09-16 13:59
you always process process seqno+1

vukolic
2016-09-16 13:59
but process seqno does not have to be commit seq no + 1

simon
2016-09-16 14:02
well let's leave that for later :slightly_smiling_face:

simon
2016-09-16 14:03
my basic persistence stuff seems to be sort of working

simon
2016-09-16 14:03
at least in my tests i can restart a node and it can participate in the network again

oiakovlev
2016-09-16 18:33
Some silly question, which was partially raised in general (by other person) but after re-thinking want to re-ask it here. What consensus actually means or guaranty us: that all peers will have the same world state, right? But can we imagine that chaincode just store random number let's say using the parameter as a key. Question here: I believe that different peers will try to store different values to KVS (as they are random), right? but actually transactions are valid for all peers, so they will be accepted by consensus. On other hand different peers have different values in KVS. I realize that this is very corner case example but... still where I'm wrong?) I mean am I wrong saying that world state should be the same on all nodes and this is guarantied by consensus or consensus will handle such cases somehow? And even more such transactions means that they are not deterministic - so now this is responsible of developer to write deterministic one or there is some 'protection'?

garisingh
2016-09-16 18:34
currently the responsibility of the developer to write deterministic code

oiakovlev
2016-09-16 18:38
make sense, just double checking.. if we use some sort of Oracle to get some data from the 'external world' and result from Oracle can be date/time dependent (exchange rates, for example) than replaying the ledger might lead to different results... At least if HL has some sort of such service usage of them should be deterministic as well - for example pass date/time or whatever makes it deterministic.. Sorry for thinking in loud...

vukolic
2016-09-17 00:50
@oiakovlev you may want to check out the architectural direction that fabric is taking


vukolic
2016-09-17 00:50
that architecture has a different approach to non-detemrinistic code

vukolic
2016-09-17 00:51
but it would still be expected from developer to code deterministic chaincode

vukolic
2016-09-17 00:51
although fabric will give some protection

vukolic
2016-09-17 00:52
against the effects of non-deterministic chaincode

donovanhide
2016-09-18 15:40
has joined #fabric-consensus-dev

donovanhide
2016-09-18 19:31
Re-posting from #fabric: Have just been reading through the consensus docs, specifically the endorsement stage : https://github.com/hyperledger/fabric/blob/master/proposals/r1/Next-Consensus-Architecture-Proposal.md#23-an-endorser-receives-and-endorses-a-transaction Given that the transaction simulation stage runs on both the submitting peer and the endorsing peers and transactions are not broadcast as a batch, I’m wondering how this design will deal with highly contended keys in the value store. For example, say the ledger is used by the chaincode to hold an orderbook, with offers constantly changing at high frequency. It is likely that clients will be submitting to different peers at around the same time transactions that will modify the tip of the orderbook. My reading of the design document is that this will frequently lead to the `STALE_VERSION` endorsement being returned. Ripple’s approach to this issue is to group a set of transactions into a batch which are processed in a hard to predict determinisitc order and loose time constraints dictate which transactions get into which batch. I’d be very interested to hear any views on this potential issue :slightly_smiling_face: https://github.com/hyperledger/fabric/blob/master/proposals/r1/Next-Consensus-Architecture-Proposal.md#41-batch-and-block-formation discusses this a little, but it seems that the batching occurs after endorsement?

zhuang.wei.ming
2016-09-18 23:49
has joined #fabric-consensus-dev

simon
2016-09-19 08:17
donovanhide: correct

simon
2016-09-19 08:17
donovanhide: i think contention is contention. short of implementing field calls, this is unavoidable

donovanhide
2016-09-19 09:25
@simon Thanks for the response! Can you define what you mean by field calls? My question revolves around whether batching transactions together, and the batch itself is endorsed, would mean the probability of contention is reduced. Are you saying that hyperledger consensus would theoretically perform badly for use cases like shared orderbooks?

simon
2016-09-19 09:26
not inherently

simon
2016-09-19 09:26
you just need to write your code so that you don't stomp on the state of parallel transactions

simon
2016-09-19 09:26
field calls is if you tell your database "add 5 to this field", instead of doing the adding yourself

donovanhide
2016-09-19 09:27
Do you have any ideas for an orderbook data structure that could handle that kind of parallel mutation?

simon
2016-09-19 09:27
what is an orderbook?

donovanhide
2016-09-19 09:28
An orderbook is a set of offers made by accounts to buy or sell an asset. It typically sees lots of activity at the best price as traders contend to be the best offer. So if you had offers as data type in the ledger, you’d probably also need an index to order them by price and time created.

donovanhide
2016-09-19 09:29
The index would see a lot of contention on popular orderbooks.

simon
2016-09-19 09:29
i don't think you should maintain an index

donovanhide
2016-09-19 09:30
How would you iterate the orderbook when a crossing offer comes in without one?

simon
2016-09-19 09:30
yes, that's difficult

simon
2016-09-19 09:30
you could do a database query

simon
2016-09-19 09:31
but a scan of the table means that you produce a readset

simon
2016-09-19 09:31
maybe @chetsky has a good idea

donovanhide
2016-09-19 09:34
It’s probably worth researching Ripple a bit to get some ideas of potential issues: https://ripple.com/build/ledger-format/#offer Ripple has a Directory node type in the ledger which is a linked list of pointers to offers. Contention on them is reduced by processing/endorsing multiple transactions at a time.

donovanhide
2016-09-19 09:36
One possible solution is that if indexes are stored externally to the ledger and don’t alter the world state hash, but are automatically updated when a qualifying data type changes, then you could have a very efficient system. General purpose indexes are hard though :slightly_smiling_face:

simon
2016-09-19 09:39
yes

donovanhide
2016-09-19 09:39
The index would also have to be accessible from chaincode.

simon
2016-09-19 09:39
i think as little as possible should be part of the chaincode

simon
2016-09-19 09:39
no, it doesn't

simon
2016-09-19 09:40
imagine this:

simon
2016-09-19 09:40
you have an application, and it consumes the incoming list of (purchase/sell) offers

simon
2016-09-19 09:40
now you want to perform a purchase

simon
2016-09-19 09:40
so you pick a matching sell that you like (maybe you don't want to trade with specific entities)

simon
2016-09-19 09:41
and then you formulate a "match existing sell with this purchase" transaction

simon
2016-09-19 09:41
the chaincode just checks whether that sell is still available, and endorses the transaction

simon
2016-09-19 09:42
or, if the sell expired or was consumed by somebody else, the chaincode does not endorse, you receive an error (basically you lost a race), and then you retry

simon
2016-09-19 09:42
no contention

simon
2016-09-19 09:42
does that make sense?

donovanhide
2016-09-19 09:43
So you are suggesting storing the set of buy and sell offers as an unordered bag in the ledger, and the event stream feeds additions and deletions to that set, the client maintains the ordering and selects specific offers to attempt to consume?

donovanhide
2016-09-19 09:44
That works until a buy offer is submitted which is higher than an existing sell offer. Also offers may consume one *or more* existing offers. So it does get complicated quickly.

donovanhide
2016-09-19 09:47
The question fundamentally boils down to random access vs sequential access of entities in the ledger. If sequential access is required an index will be contended due to concurrent, but one by one, transaction processing.

donovanhide
2016-09-19 09:48
Maybe it’s not a huge issue if all nodes are close to each other on a network, but if they are geo-disparate the network latency will amplify the contention.

simon
2016-09-19 09:50
no, don't maintain an index

simon
2016-09-19 09:50
imagine how you would do this on bitcoin

simon
2016-09-19 09:51
you'd parse specific transactions to see which ones are sell or buy offers, and then you'd create a transaction that matches

simon
2016-09-19 09:52
what you are describing is an inherently contending application

donovanhide
2016-09-19 09:52
I think what you are saying is just do the trade settlement in hyperledger and store the orderbook externally?

simon
2016-09-19 09:53
no, you can store the orderbook in hyperledger

simon
2016-09-19 09:53
oh, maybe you could do this:

simon
2016-09-19 09:53
you partition offers and matching

simon
2016-09-19 09:54
you have one chaincode (or a section), which records the sequence of offers (buy and sell)

simon
2016-09-19 09:54
hm, sequence is a problem with the current architecture

simon
2016-09-19 09:55
which means that we need to reify a primitive that exposes the total order broadcast nature of consensus

simon
2016-09-19 09:56
if that primitive existed, you would have a defined order, and you could deterministically perform the matching

simon
2016-09-19 09:56
and then run this matching through another chaincode transaction

donovanhide
2016-09-19 09:57
Well, just skimmed your paper, I think "Execute-then-order” with speculative execution might not perform well on contended resources, compare to "Order-then-execute”. Ripple chooses the latter. I strongly think that an orderbook example would be a great benchmarking testcase for hyperledger. It’s a fun, but difficult problem :slightly_smiling_face:

simon
2016-09-19 09:58
order-then-execute is what we have right now

donovanhide
2016-09-19 09:58
External indexes is one possible solution. Ripple made the mistake of internalising them, which uses a huge amount of data storage.


simon
2016-09-19 09:59
the problem is that hyperledger uses go as implementation language for chaincode

simon
2016-09-19 09:59
which means people keep implementing non-deterministic code

simon
2016-09-19 09:59
which just breaks order-then-execute systems

donovanhide
2016-09-19 10:00
You can write deterministic code in Go, just have to make sure all your inputs are deterministic :slightly_smiling_face:

simon
2016-09-19 10:00
nope

donovanhide
2016-09-19 10:00
Really?

simon
2016-09-19 10:00
not only that

simon
2016-09-19 10:00
maps are non-deterministic

simon
2016-09-19 10:00
memory addresses are non-deterministic

donovanhide
2016-09-19 10:00
Don’t use maps :slightly_smiling_face:

simon
2016-09-19 10:01
it is easy to have global state

donovanhide
2016-09-19 10:01
Ordered slices!

simon
2016-09-19 10:01
my experience is that in 100% of the cases where people said "something is wrong with consensus, the network just stops", it actually was caused by non-deterministic chaincode

simon
2016-09-19 10:01
you need to be an expert programmer to do it right

simon
2016-09-19 10:02
and still you might get it wrong

simon
2016-09-19 10:02
and then your whole network stops

simon
2016-09-19 10:02
it's a terrible DoS vector

donovanhide
2016-09-19 10:03
Well, maybe drawing up some coding guidelines and providing some deterministic random access data structures (OrderedSet and OrderedMap) would help.

simon
2016-09-19 10:03
but it wouldn't be able to rule out the problems

donovanhide
2016-09-19 10:04
Are you suggesting using a functional language instead? Or something more locked down like Solidity?

simon
2016-09-19 10:04
something designed to be deterministic

donovanhide
2016-09-19 10:05
Sounds like a research paper :slightly_smiling_face:

simon
2016-09-19 10:05
sounds like a solved problem

simon
2016-09-19 10:05
because, solidity

donovanhide
2016-09-19 10:05
But then, you are just rewriting Ethereum :slightly_smiling_face:

simon
2016-09-19 10:06
oh didn't you notice that hyperledger is just a copy of ethereum?

donovanhide
2016-09-19 10:06
:slightly_smiling_face:

donovanhide
2016-09-19 10:07
Still a bit confused by "order-then-execute” in current implementation. Is that linked document saying that the next version will be “execute-then-order”?

donovanhide
2016-09-19 10:08
Also, no one has successfully yet written a working distributed orderbook in Ethereum.

simon
2016-09-19 10:10
yes, you read the design of the next architecture

simon
2016-09-19 10:11
currently transactions come in, are ordered, and then every validating peer executes them in the same order

donovanhide
2016-09-19 10:11
There’s currently no checking of previous key values in the endorsement step?

donovanhide
2016-09-19 10:16
It’s difficult to summarise this, but I think the design choice of processing transactions individually and checking world state hashes and changed key/values after each execution, rather than grouping transactions by submission time and updating the world state hash after successful transactions have executed, will lead to some difficult contention issues.

donovanhide
2016-09-19 10:17
I’m not saying Ripple has done everything right, it’s just a question of whether similar use cases to Ripple can be served by hyperledger.

simon
2016-09-19 10:21
i think you're raising an interesting use case

donovanhide
2016-09-19 10:22
Well, banks like orderbooks :slightly_smiling_face:

simon
2016-09-19 10:22
they do?

donovanhide
2016-09-19 10:22
It’s the truth :slightly_smiling_face:

simon
2016-09-19 10:22
so far what i heard was that banks want to do the settlement

simon
2016-09-19 10:22
but the order matching happens elsewhere

donovanhide
2016-09-19 10:24
Well, there is a huge market for corporate client cash pooling, which involves moving funds, cross-currency from subsidiary accounts to primary accounts. For the cross-currency exchange to occur, an orderbook is needed. If the orderbook can be in the same system as the bank account balances, it can all run at the same tick and be atomic. Can’t be too detailed, but happy to discuss privately :slightly_smiling_face:

simon
2016-09-19 10:26
so this involves buy/sell offer matching?

simon
2016-09-19 10:26
or just keeping record of transactions

donovanhide
2016-09-19 10:26
Both.

donovanhide
2016-09-19 10:27
Basically, we have used Ripple extensively for testing, it has some major issues we’d like to address. Hyperledger is potentially a useful platform for authoring an alternative.

jamie.steiner
2016-09-19 10:27
I have a golang orderbook library - pretty small but performant. double auction market/price time priority. I havent decide how I want to license it yet though

simon
2016-09-19 10:28
what major issues?

donovanhide
2016-09-19 10:29
Ability to submit a complete ladder of offers and to be able to update them in a reasonable amount of time. Hyperledger would allow us to write custom transactions, such as UpdateLadder, rather than submit 20 separate offers individually.

simon
2016-09-19 10:29
i see

donovanhide
2016-09-19 10:29
@jamie.steiner Would be interested in looking at what you’ve got!

simon
2016-09-19 10:30
would you say that if you had a way to order offers through one chaincode, the system wouldn't suffer from contention?

jamie.steiner
2016-09-19 10:30
I wrote it a while back - I doubt I'd be able to sell it as is -will think on it tonight.

simon
2016-09-19 10:30
i think it would have to percolate through the system twice

simon
2016-09-19 10:31
once to order, and a second time to confirm the matches

jamie.steiner
2016-09-19 10:32
in my experience, the matches can be returned from a call to process the order. the orderbook itself can be threadsafe using locks, but the order of entering the orders absolutely matters.

donovanhide
2016-09-19 10:33
Basically, what will happen is that you have multiple market makers all submitting offers to an orderbook as external prices change. If they do so at the same time, the contention will kick in and retries might dominate.

donovanhide
2016-09-19 10:33
@jamie.steiner are we talking about an orderbook running in hyperledger?

jamie.steiner
2016-09-19 10:33
I disagree - you should use a lock on the orderbook state, and just deal with contention that way

jamie.steiner
2016-09-19 10:35
the orderbook - in hyperledger or elsewhere - will need to process orders in order to achieve the same state. so if it's being run in shared state, the ordering of transactions needs to be determined prior to processing those orders

jamie.steiner
2016-09-19 10:35
the contention i am referring to is actually more about the case where multiple threads are dropping orders into the book. that may or may not apply in this situation.

simon
2016-09-19 10:35
byzantine distributed systems can't use locks

simon
2016-09-19 10:36
that just moves the contention elsewhere

jamie.steiner
2016-09-19 10:36
if you can guarantee ordering of transactions, you dont need the lock

donovanhide
2016-09-19 10:36
I think we’re talking about different things. This discussion is about how to implement an orderbook in hyperledger and index the offers without contending updates.

jamie.steiner
2016-09-19 10:36
I designed for different use case, but it should still work

simon
2016-09-19 10:36
i still maintain that indexing should happen elsewhere

jamie.steiner
2016-09-19 10:38
I don't understand the problem of indexing and contention?

donovanhide
2016-09-19 10:38
@simon I agree that the index is the nub of the problem. If it was possible to have a peer provide a API calls to update and range scan external indexes, that would be one possible solution.

donovanhide
2016-09-19 10:39
@jamie.steiner hyperledger consensus (at least the next version) intends to check previous key values during the endorsement stage. If multiple clients are submitting offer transactions at the same time to different peers, each peer might have different key values and mark the transaction as stale. The endorsement happens before the total order is created. That is my understanding anyway...

simon
2016-09-19 10:40
external = non-deterministic

simon
2016-09-19 10:40
index is just a performance thing

donovanhide
2016-09-19 10:40
not if the peer is responsible for the indexing?

simon
2016-09-19 10:40
why not perform the indexing and matchmaking in an outside application?

simon
2016-09-19 10:41
and use a chaincode to validate the matchmaking

donovanhide
2016-09-19 10:41
@simon how do you then ensure atomicity between the orderbook and the balances? You’re in interledger territory then :slightly_smiling_face:

jamie.steiner
2016-09-19 10:41
If multiple clients are submitting offer transactions at the same time to different peers, the order of those transactions certainly must be decided, prior to placing them in the orderbook.


jamie.steiner
2016-09-19 10:45
If I am understanding that correctly, it seems like the endorser is meant to execute the transaction right away against it's own state before relaying it to other peers?

donovanhide
2016-09-19 10:45
@simon to validate external match-making, you’d still need to be able to range over the offers.

donovanhide
2016-09-19 10:45
@jamie.steiner That’s my understanding.

jamie.steiner
2016-09-19 10:45
that wont work

donovanhide
2016-09-19 10:45
Hence the discussion :slightly_smiling_face:

jamie.steiner
2016-09-19 10:46
the transactions need to be ordered by the consensus protocol - surely that's something it already provides for, no?

donovanhide
2016-09-19 10:46
*After* the endorsement(s).

jamie.steiner
2016-09-19 10:47
if that's not possible to change, then the actual state update should only occur after correct order is established.

jamie.steiner
2016-09-19 10:48
I dont think this problem is unique to orderbooks - it is basically the double spending problem, with more complicated state

simon
2016-09-19 10:49
donovanhide: i guess...

donovanhide
2016-09-19 10:49
Once transactions are endorsed, they can still fail during execution, as I understand it. It’s just the endorsement is a kind of “pre-filter"

donovanhide
2016-09-19 10:49
but one that won’t work well with contended resources.

jamie.steiner
2016-09-19 10:49
tendermint works this way too - it's not more useful than a spam filter

simon
2016-09-19 10:50
the endorsement 1) shows that the chaincode executed correctly, and 2) proves that the stakeholders agree with the transaction

simon
2016-09-19 10:51
yes, i agree that this design is not good

donovanhide
2016-09-19 10:51
Well, that’s progress :slightly_smiling_face: Good to identify issues early :slightly_smiling_face:

simon
2016-09-19 10:51
but i can't do anything about it

donovanhide
2016-09-19 10:53
@simon because the design is fixed?

simon
2016-09-19 10:54
yes, and my opinions do not influence the design

donovanhide
2016-09-19 10:54
Who is the design lead? Can I make a case to that person?

simon
2016-09-19 10:58
there is @garisingh and @binhn

donovanhide
2016-09-19 10:59
Cool, thanks! Will try and engender some further discussion.

simon
2016-09-19 11:00
great

jamie.steiner
2016-09-19 11:01
as an example of the problem: I have $100. I make two, otherwise completely valid transactions sending the whole amount - one to Bob, and one to Alice. I send them at the same time to different peers. Each executes them against their local state before socializing them. Hilarity does not ensue.

donovanhide
2016-09-19 11:03
@jamie.steiner I believe those transactions might succeed in the transaction simulation stage, but will fail in the ordered execution stage. It’s ok if some payments fail as they happen much less frequently than offer updates, which will happen all the time by multiple parties.

simon
2016-09-19 11:04
correct

simon
2016-09-19 11:04
one is a double spend attack

jamie.steiner
2016-09-19 11:04
more precisely, one of them will fail in ordered execution.

simon
2016-09-19 11:04
yes, one has to fail

jamie.steiner
2016-09-19 11:05
and one should succeed. if that's the case, why could the same procedure not apply to an orderbook?

jamie.steiner
2016-09-19 11:05
it's just more complicated state - but state that depends on order, all the same

simon
2016-09-19 11:06
because your example is an application bug or a try to exploit something

simon
2016-09-19 11:06
while an orderbook is defined by this behavior

donovanhide
2016-09-19 11:07
@jamie.steiner Because for an orderbook to be meaningful, it requires the offers are ordered. To maintain that ordered state, you need an index. If multiple offers affect the same orderbook at the same time, the index is a contended resource.

jamie.steiner
2016-09-19 11:07
i dont understand what you mean by index?

donovanhide
2016-09-19 11:08
Offer A 200 USD/EUR @ 1.3 Offer B 100 USD/EUR @ 1.31 Index maintains that Offer A is a better price than Offer B

jamie.steiner
2016-09-19 11:08
oh. I see. so you mean A's index is 0 and B's is 1

jamie.steiner
2016-09-19 11:08
or something

donovanhide
2016-09-19 11:09
Yep, the world state is just a big map of buckets. To put an orderbook in there, you need an index that points to all the offers in a useful order.

jamie.steiner
2016-09-19 11:10
I would suggest using a different data structure. I dont think a map or array is the best. I used a B-tree which has O(log n) inserts, and easy access to the top of the stack

jamie.steiner
2016-09-19 11:10
and maintains order

donovanhide
2016-09-19 11:10
Well, if you have a btree that can be serialised into the hyperledger key value store, that would be interesting :slightly_smiling_face:

jamie.steiner
2016-09-19 11:10
the index is pretty irrelevant, as long as inserts are done in the correct order

jamie.steiner
2016-09-19 11:11
hmmm

donovanhide
2016-09-19 11:11
Perhaps we need a Merkle-B-Tree :slightly_smiling_face:

simon
2016-09-19 11:11
the index is just for quick access

jamie.steiner
2016-09-19 11:12
thing is you never actually need to know the index - you only ever need the best order (when executing) plus a guarantee that the stack remains ordered correctly when you are not (menaing just adding an order)

donovanhide
2016-09-19 11:13
You might need the second and third best offer if the incoming offer crosses and consumes more than one existing offer.

jamie.steiner
2016-09-19 11:13
after the first is consumed, the second best becomes the first

jamie.steiner
2016-09-19 11:13
so no

donovanhide
2016-09-19 11:13
So you are saying a linked-list?

jamie.steiner
2016-09-19 11:14
I also looked at using a skip list - but decided on the b-tree as ideal (in my opinion). linked lists have O(n) inserts

donovanhide
2016-09-19 11:15
Yep, there is all the theoretical data structure knowledge :slightly_smiling_face: The test is how to apply it to a merkle tree for efficient access and updating.

jamie.steiner
2016-09-19 11:15

donovanhide
2016-09-19 11:16
I have also used that package :slightly_smiling_face: How would you serialise it into the merkle tree so that is uncontended? If you put it into a single key, it would have terrible contention :slightly_smiling_face:

jamie.steiner
2016-09-19 11:19
I have thought about putting an orderbook in a consensus based system a fair bit, and unfortunately my conclusion is that because the state of the art in real financial systems pushes the boundaries of what is possible in centralized systems, we are probably already around the bend on being able to do it via consensus. Still, to answer your question, I would not even try to serialize the whole structure. why not just run consensus on the new state?

donovanhide
2016-09-19 11:21
Well, it is possible, because Ripple. You should examine the hoops jumped through to make it work. The key entity is the DirectoryNode LedgerEntry type: https://ripple.com/build/ledger-format/#directorynode

donovanhide
2016-09-19 11:22
If this was mapped to hyperledger, the DirectoryNode updates would be highly contended. Which is my key point.

donovanhide
2016-09-19 11:23
What is more, all DirectoryNode updates are persisted in the log of changed nodes, so it uses an insane amount of storage.

jamie.steiner
2016-09-19 11:24
why is it necessary though?

donovanhide
2016-09-19 11:24
An ephemeral and external index of ledger entries would be a cool feature.

donovanhide
2016-09-19 11:24
@jamie.steiner Because an orderbook needs to be ordered :slightly_smiling_face: Unless you can prove to me otherwise, I’ll leave the burden of proof with you :slightly_smiling_face:

jamie.steiner
2016-09-19 11:27
If I understand the argument correctly, you are saying that the entire orderbook state needs to be serialized and stored as a part of a merkle tree because it is otherwise difficult (impossible?) to maintain consensus about the order of elements within that orderbook state.

jamie.steiner
2016-09-19 11:29
I certainly agree that if you tried to store that whole structure as a value of a single key, it will end in tears

donovanhide
2016-09-19 11:31
@jamie.steiner Think about how you’d write the chaincode to process an incoming offer. You need to examine the existing offers on the other side of the book to see if it crosses. If it does you need to iterate that side of the book and remove the crossing offers and update the appropriate balances. If it doesn’t, you need to work out which position it takes on this side of the orderbook. Both operations require ranging over orderbooks. Without an index, or maybe a linked list, you’d have to range scan over the whole merkle tree.

jamie.steiner
2016-09-19 11:38
I have no experience writing chaincode for hyperledger. I have written the exact code you describe, however, did not use any index (other than 0) to do so. I think the real question is "where is the state relating to open orders persisted?" If the orders are persisted in the merkle tree, that can be done with something that helps order them (which would allow the correct state to be (re)constructed). Once the orderbook state is constructed, and as long as it can be reconstructed via individual transactions (orders) that have been stored in the merkle treet, I don't see that updating the actual state in a persistent fashion on the merkle tree is something that needs to be done.

jamie.steiner
2016-09-19 11:39
sure, it means old (filled) orders are there, but who cares - if you reconstruct the state correctly, you have the set of currently open orders

jamie.steiner
2016-09-19 11:39
and anyway you need the audit trail of what happened, which can also be constructed

donovanhide
2016-09-19 11:43
Well, there are a myriad of issues that come up when you start trying to map things into a merkle tree. Persistence (as in Bagwell) does mean that previous states are maintained forever. In Ripple this is done by hashing the contents of a value and using that as a key, and then using a radix tree to map those keys to actual indexes. In hyperledger, key versioning is used. An index (like the DirectoryNode) reduces access reads required. Say you want to change the worst offer of a large orderbook. It will be very slow using your method… It’s not simple. Intelligent people have suffered greatly trying to solve these problems :slightly_smiling_face:

jamie.steiner
2016-09-19 11:52
"Say you want to change the worst offer of a large orderbook." cancel and replace (you lose your place in line) is the standard procedure in many markets - though admittedly not government bonds, in which I have some experience. I chose not to assume it was required in my (toy) library. Point being, it's open for debate as to wether the state of an order should be allowed to change - apart from cancellation. That said, either deletion or update in a decent implementation must have a separate time associated with it, and I see no reason why processing it couldn't be O(log n), since you should always be able to find a given open order based on price/time using the insertion algorithm, because if it has a different price or time, it is, by definition, a different order. I don't have much knowledge of the ripple approach, but I'm interested, and will look into it more.

donovanhide
2016-09-19 11:58
Well, it has been an interesting discussion :slightly_smiling_face: In summary, unless someone can author a world-changing, merkle-tree friendly, DOS-proof orderbook data structure, it seems like the next hyperledger consensus design might have issues with index contention which it would be great to discuss further with @garisingh and others. Thanks for everyone’s time!

simon
2016-09-19 12:04
@jyellick, @vukolic: so with ecdsa signatures, one sbft cycle takes ~0.8ms for a one node network, ~4ms for a 4-node nework, and 640ms for a 80-node network

donovanhide
2016-09-19 12:05
Have you considered ed25519?


simon
2016-09-19 12:07
i'm just adding a test

simon
2016-09-19 12:07
i'd like to use ed25519, but my guess is other forces would want to use NIST stuff

donovanhide
2016-09-19 12:08
I know a chap from Intel made ECDSA a lot faster recently for Go 1.7 (I think). I had a lot of fun helping speed up ed25519 :slightly_smiling_face: https://github.com/agl/ed25519/commits/master

simon
2016-09-19 12:09
how much did that gain?

simon
2016-09-19 12:11
i gotta check out for a while - need to eat

donovanhide
2016-09-19 12:11
Can’t find the old benchmarks… Seem to remember it was nearly 2.5x faster and not far off the C implementation. djb’s assemby version was 2x faster than the C… Long time ago :slightly_smiling_face:

jamie.steiner
2016-09-19 12:32
I compared that implementation to P256 from go standard lib recently - seem to recall it was impressively quick.

simon
2016-09-19 14:41
@jyellick, @hgabor: do you think we should augment the `Deliver` function between sbft and app to include a sequence number?

simon
2016-09-19 14:42
it's not going to be the batch number, unless we can skip it

hgabor
2016-09-19 14:43
how will we generate the seq number?

simon
2016-09-19 14:58
sbft has an internal seq number

hgabor
2016-09-19 15:00
the advantage of this is not clear for me

simon
2016-09-19 15:04
crash restart

jyellick
2016-09-19 15:05
@simon We need to talk about how we do config, whether it's part of the `Deliver` chain or not

jyellick
2016-09-19 15:06
If it is, then we need to either a) modify the block structure, or b) define some sort of wrapping data structure for the data in the block

simon
2016-09-19 15:06
yes

jyellick
2016-09-19 15:06
The other thing that gets to be really tricky is pruning

jyellick
2016-09-19 15:06
What seems far easier would be to instead have a second chain only for config

simon
2016-09-19 15:06
i think it should use request and deliver

jyellick
2016-09-19 15:06
And checkpoint on both of them

jyellick
2016-09-19 15:07
That way, we never need to worry about pruning config, and, we don't have to modify the block or wrap the binary blobs

jyellick
2016-09-19 15:07
It would grow the checkpoint message slightly, but minimally relative to the overall message sizes

simon
2016-09-19 15:07
ah i see

simon
2016-09-19 15:08
but would you still use request/deliver?

jyellick
2016-09-19 15:08
Do you mean `Broadcast`/`Deliver`?

simon
2016-09-19 15:08
no

simon
2016-09-19 15:08
this is consensus internal

jyellick
2016-09-19 15:09
Ah, so `request` is an internal function to grab a batch?

simon
2016-09-19 15:09
sec

jyellick
2016-09-19 15:09
(And I would say, yes)

simon
2016-09-19 15:09
```type Receiver interface { Receive(msg *Msg, src uint64) } type System interface { Send(msg *Msg, dest uint64) Timer(d time.Duration, t timerFunc) Canceller Deliver(batch [][]byte) SetReceiver(receiver Receiver) Persist(key string, data proto.Message) Restore(key string, out proto.Message) bool Sign(data []byte) []byte CheckSig(data []byte, src uint64, sig []byte) error } ```

simon
2016-09-19 15:10
actually, it also implements Request

simon
2016-09-19 15:10
so that's missing info

jyellick
2016-09-19 15:10
Presumably `Request` is for state transfer?

simon
2016-09-19 15:11
func (s *SBFT) Request(req []byte) { s.broadcast(&Msg{&Msg_Request{&Request{req}}}) }

simon
2016-09-19 15:11
atomic `Broadcast` -> sbft `Request`

jyellick
2016-09-19 15:11
Ah, okay

jyellick
2016-09-19 15:12
So very much the `Request` from obcpbft

simon
2016-09-19 15:12
later, sbft calls `sys.Deliver`, which persists the raw chain and itself does the atomic `Deliver`

simon
2016-09-19 15:12
yes

jyellick
2016-09-19 15:13
So the basic problem is we want `Deliver` and persisting the sequence number to be atomic

simon
2016-09-19 15:13
i think we should use Request/Deliver from sbft to sequence the config

simon
2016-09-19 15:14
my feeling is that we need to change Deliver to pass out the signatures as well

jyellick
2016-09-19 15:15
I agree we could use `Request` to sequence the config, but I think it is dependent on how would want to store the config

simon
2016-09-19 15:15
we could pass in a config flag with request

simon
2016-09-19 15:15
or some metadata

jyellick
2016-09-19 15:15
We could make the message use `oneof`

simon
2016-09-19 15:16
yes

simon
2016-09-19 15:16
we can even do that from the outside

simon
2016-09-19 15:16
and to sbft it is completely opaque

jyellick
2016-09-19 15:21
I am backtracking on this second chain in my head now, we should encode validation policy in the chain, and if we need to handle pruning and that, it would not be that much harder to handle it for the rest of config

simon
2016-09-19 15:24
i think i agree, but i'm not sure what you are saying

jyellick
2016-09-19 15:26
Sorry if I am scattered, in a room of about 20 people right now who are talking about bootstrapping

jyellick
2016-09-19 15:26
Basically, I am thinking that the bft network is going to need to retain some amount of the 'rawledger' thing that gets sent via `Deliver`

simon
2016-09-19 15:27
yes, for state transfer

simon
2016-09-19 15:27
and config

jyellick
2016-09-19 15:27
We should make sure we support pruning of this chain

jyellick
2016-09-19 15:27
This chain should be self validating, which means it should encode a validation policy (set of keys) into it

simon
2016-09-19 15:28
in the config

jyellick
2016-09-19 15:28
This means, if we want to support pruning, and retain the validation policy (and config), we're going to need to do something clever. Either periodically re-asserting it at a frequency greater than the pruning, or, something else?

simon
2016-09-19 15:29
why?

jyellick
2016-09-19 15:29
Well, if the config/validation is encoded in the chain, and we only retain... 10k blocks, after 10k blocks since the config changed, how do we do the validation? How do we know our config?

simon
2016-09-19 15:30
we write the config out separately as well

simon
2016-09-19 15:30
so that we don't have to parse the whole chain to find the latest config

jyellick
2016-09-19 15:30
But how do we validate the config? How do we know it hasn't changed since we were last up?

simon
2016-09-19 15:31
it's valid because it is stored in our store

simon
2016-09-19 15:31
we don't know whether it changed until we contact the network

simon
2016-09-19 15:32
the network can tell us the latest config

jyellick
2016-09-19 15:32
So this is some extra process of consenting on config?

simon
2016-09-19 15:32
no

simon
2016-09-19 15:33
when i come up, i say "hey, i just restarted. what's the last batch we're at, and what's the last config?"

jyellick
2016-09-19 15:33
And you require f+1 same config responses?

jyellick
2016-09-19 15:34
(I'd argue that is an extra process of consenting on config)

simon
2016-09-19 15:34
no, you only require one, because it contains the signatures required for config change

jyellick
2016-09-19 15:35
But you don't know if that's stale config

simon
2016-09-19 15:35
fine, then wait for f+1

simon
2016-09-19 15:36
all this complication would go away if we co-located consensus and committer, and just maintained one single blockchain

simon
2016-09-19 15:36
there wouldn't be 3 copies of the data

simon
2016-09-19 15:36
nor pruning

jyellick
2016-09-19 15:36
Why would pruning go away?

simon
2016-09-19 15:36
because you always maintain the one blockchain

simon
2016-09-19 15:37
or your history is lost

simon
2016-09-19 15:37
for all kinds of regulatory reasons that's not allowed anyways

jyellick
2016-09-19 15:37
I think there will likely still be some notion of pruning/archiving

simon
2016-09-19 15:39
but since that design won't happen anyways, we don't have to talk about it

jyellick
2016-09-19 15:44
One solution would be if we have some notion of encoding a more total policy on the chain, something like; "This is the set of valid public keys for signing, we require signatures for k of them, in order to validate this chain, you should not prune this block until a new config is written, which will be in at most L blocks.", but I'm not certain I like this. Sticking everything on one chain drastically simplifies things by not having to coordinate between chain state and config, but makes supporting pruning very tricky. Splitting them into two chains of course has the exact opposite problem.

simon
2016-09-19 15:45
then let's not prune

jyellick
2016-09-19 15:45
That is what I was about to say "Maybe we say screw pruning"

simon
2016-09-19 15:46
we should advise that the design as it is is space inefficient and pruning is very difficult

simon
2016-09-19 15:46
maybe you can bring that up during the meetings this week

jyellick
2016-09-19 15:46
I know it is on the agenda for later, I'd like to know how the validated ledger side sees pruning working @dave.enyeart

simon
2016-09-19 15:46
i won't be able to join on account of sickness

jyellick
2016-09-19 15:47
Yes, sorry you are still not feeling well, hope you get better soon

dave.enyeart
2016-09-19 15:47
has joined #fabric-consensus-dev

simon
2016-09-19 15:47
so do i - i keep getting sick on the weekend, which is very frustrating

simon
2016-09-19 15:47
yea, how would the validated ledger be pruned?

simon
2016-09-19 15:48
how do you build your world state?

simon
2016-09-19 15:48
you need a full copy

jyellick
2016-09-19 15:49
Yes, I feel like maybe some sort of special transaction, possibly specifying the set of data for archiving, but it seems hard, much harder even than in the sbft config case

simon
2016-09-19 15:50
bitcoin doesn't do that

simon
2016-09-19 15:50
i don't think any ad hoc solution will work out right

simon
2016-09-19 15:51
well anyways, what's on the agenda to hack for me?

simon
2016-09-19 15:52
i thought about implementing the application state (chain) interface, and then go to state transfer

simon
2016-09-19 15:53
so the `Deliver` API will have to change a bit

simon
2016-09-19 15:53
include a sequence number, and signatures

simon
2016-09-19 15:53
maybe deliver a block directly?

simon
2016-09-19 15:54
in case we decided on that format

simon
2016-09-19 15:54
but how does the consensus config fit in there

simon
2016-09-19 15:54
probably deliver from sbft won't provide a block

simon
2016-09-19 15:54
that's what the application would do

jyellick
2016-09-19 15:55
Bitcoin has no real remedy to unbounded growth, right? Presumably in 500 years someone will reference one of the early bitcoins that hasn't been touched and everyone will need the beginning of the chain?

simon
2016-09-19 15:55
yes

simon
2016-09-19 15:55
i think they say that storage will grow faster

simon
2016-09-19 15:55
anyways, does that seem like a reasonable thing to do next?

jyellick
2016-09-19 15:57
I was hoping to get around to hacking on the state transfer stuff, but with everyone visiting this week, that seems like an unlikely dream

simon
2016-09-19 15:57
yea

jyellick
2016-09-19 15:57
I don't think I saw any comments from you on the rawledger interface?

simon
2016-09-19 15:57
you mean the block interface?


simon
2016-09-19 15:58
ok i gotta run

simon
2016-09-19 15:58
say hi to everybody

jyellick
2016-09-19 15:59
Bye Simon, get well soon

simon
2016-09-20 10:51
ha, that sped up cycle time

simon
2016-09-20 10:52
if you only look at f+1 checkpoints, and check their signatures only once

simon
2016-09-20 10:52
22ms cycle time for an 80 node network

simon
2016-09-20 11:06
i think that's quite acceptable for a general purpose simple implementation

garisingh
2016-09-20 11:59
feeling better @simon?

simon
2016-09-20 12:03
slowly

simon
2016-09-20 12:03
not recovered by a long shot

simon
2016-09-20 12:03
but hacking some code

simon
2016-09-20 12:03
meaning i'm doing good enough

simon
2016-09-20 12:04
@garisingh did you see my discussion with @donovanhide yesterday?

garisingh
2016-09-20 12:05
hacking code better than hacking lungs :wink:

garisingh
2016-09-20 12:05
let me look back at that exchange

donovanhide
2016-09-20 12:06
@simon @garisingh Here if it is a good time to discuss :slightly_smiling_face:

garisingh
2016-09-20 12:07
@donovanhide - are you stalking us ? :wink:

donovanhide
2016-09-20 12:07
I’m just omnipresent :slightly_smiling_face: Plus slack makes a noise when someone says my name :slightly_smiling_face:

garisingh
2016-09-20 12:08
I think this actually made it into Go 1.6 - they finally got over licensing and added in some nice assembly for this

donovanhide
2016-09-20 12:08
Yep, Intel and Cloudflare started throwing some amazing resources at Go!

donovanhide
2016-09-20 12:09
IBM too :slightly_smiling_face:

donovanhide
2016-09-20 12:21
@garisingh Not sure how far you might have scrolled back, but the main issue discussed was how well the next design for consensus could deal with highly contended resources, such as an index for an orderbook.

donovanhide
2016-09-20 12:25
Oh, I just scrolled back as well. Looks like Slack might have removed a substantial amount of the conversation :disappointed:

donovanhide
2016-09-20 12:26
Maybe you need to get the paid-for Slack..

donovanhide
2016-09-20 12:32
Strange, the history comes back when I reload Slack… Who knows :slightly_smiling_face:

garisingh
2016-09-20 13:08
I was able to read the whole history - I see what you are saying. I gotta run for a bit (actually working out some of these things), but I'll get back to you. There are actually some things that we have thought about in terms of plug points and I also think that we can look at other state machine models in addition to the MVCC default model

garisingh
2016-09-20 13:09
all good points and worth discussing. I have some ideas on how to deal with some of the things you brought up

donovanhide
2016-09-20 13:36
Great, thanks for taking the time to look!

hhadass
2016-09-21 21:09
has joined #fabric-consensus-dev

simon
2016-09-22 11:41
so i'm moving to state transfer, and i need to define an interface between sbft and system

simon
2016-09-22 11:42
`GetBatch(uint64) (*Batch, err)` seems the first idea

simon
2016-09-22 11:42
some of these batches don't exist

simon
2016-09-22 11:42
because of null requests

simon
2016-09-22 11:42
so hm

jyellick
2016-09-22 13:07
Also presumably pruning

simon
2016-09-22 13:08
no pruning

simon
2016-09-22 13:08
i'm not going to implement pruning

jyellick
2016-09-22 13:10
I am hoping we can finally nail down the actual block format today, I'm not certain how we can hope to support configuration on the chain until the block format is finalized

simon
2016-09-22 13:10
i still don't quite know how config changes should run through

jyellick
2016-09-22 13:11
We discussed this some with Marko

jyellick
2016-09-22 13:12
The most obvious place is to do so during a view change, but I am wary of adding complexity to that process

simon
2016-09-22 13:12
i mean, how does it get communicated first?

jyellick
2016-09-22 13:12
(Especially considering the scenario where f=1, n=6 -> f=2, n=7)

jyellick
2016-09-22 13:12
Presumably a special transaction? Piggy backed on the normal Request mechanism

simon
2016-09-22 13:13
so it is part of a batch?

jyellick
2016-09-22 13:13
I would think so

tuand
2016-09-22 13:13
the bluemix and z admin folks are looking for ways to monitor that a v0.5 or v0.6 pbft replica is not participating ? they have machine status (how long up/down) and chain height ... can't really say check for replica chain height since there's always someone lagging ... maybe monitor input rate versus block chain height increase ?

simon
2016-09-22 13:13
jyellick: so it gets delivered to the committers are well

jyellick
2016-09-22 13:14
simon: Yes. The committers need to know what the validity conditions are

jyellick
2016-09-22 13:14
Whether they are validating via signatures or via 2f+1 connections

jyellick
2016-09-22 13:14
They need to know that the old threshold is no longer valid for validation

simon
2016-09-22 13:14
so this needs to be a defined format that is shared between committers and consensus

jyellick
2016-09-22 13:14
Yes, I believe so

simon
2016-09-22 13:14
ok

jyellick
2016-09-22 13:16
I really like the idea of self validating chains, keeping the policy on the chain. My only concern with this is if we must support many many chains, that this configuration could end up duplicated many many times

jyellick
2016-09-22 13:17
The alternative would be to keep some sort of side chain, and inject a special "The configuration has changed" event into all the chains to go look up the new configuration, but I'm not sure the complexity is worth the gain.

jyellick
2016-09-22 13:18
Things seem to be in flux as to whether we will be supporting 'channels' where the transactions for the subchain go through ordering, or whether we will simply provide an interface to allow the easy construction of side chains (point to point n out of n agreement, then pushing a transaction+signatures+salt hash onto the main chain when they want finality)

jyellick
2016-09-22 13:20
Marko was pushing the latter method, and I think it certainly makes the implementation more straightforward and probably more scalable

jyellick
2016-09-22 13:20
In pretty much every other blockchain system, this is the solution for high throughput confidential stuff

jyellick
2016-09-22 13:26
@tuand I'm not sure what the difference between machine status chain height and replica chain height. If the chain height matches, and queries are not rejected due to 'inconsistent state', then the replica should be up to date

tuand
2016-09-22 13:27
sorry, not being clear, they have 2 bits of info ... current machine status and chain height

tuand
2016-09-22 13:29
i think the worry is ... given we see a replica lagging, is that condition indicative of some issue ?

jyellick
2016-09-22 13:37
PBFT allows (and arguably encourages) f nodes not to participate in the network

jyellick
2016-09-22 13:38
So, lagging isn't really indicative of anything going wrong

tuand
2016-09-22 13:38
exactly

jyellick
2016-09-22 13:39
However, if the lagging continues under low load, then it seems possible that there is something wrong with timeouts for state transfer

jyellick
2016-09-22 13:39
With the new default logging in place, this should be detectable (or, anytime statetransfer logs at warning or better)

tuand
2016-09-22 13:40
right ... so they might want to monitor logs in real time rather than monitor chain height

jyellick
2016-09-22 13:43
Statetransfer warnings would probably be the best indication that something is going wrong

jyellick
2016-09-22 13:43
(They are just warnings, and do not necessarily indicate that the system will never recover)

tuand
2016-09-22 13:45
agreed ... just to flag someone to take a more in-depth look

tuand
2016-09-22 13:46
btw, did the discussion get to detection of byzantine nodes yesterday ?

yacovm
2016-09-22 13:59
hey \@channel, I understood that in the new architecture, the peers would gossip among themselves in order to create the same validated blocks (out of the raw blocks). 1) Is that correct? 2) How is this going to be done? How will the peers know who to chat with and how many, when to create the validated blocks, who to tell about this, etc? 3) Is this going to be an autonomous effort or a group-effort? because... a group effort sounds a bit like a consensus type problem

jyellick
2016-09-22 13:59
@tuand It did, with the simple answer of "We can target specific scenarios, but a general solution is pretty much impossible"

jyellick
2016-09-22 14:03
@yacovm 1) The creation of the same block should be possible purely by having the same raw ledger, the gossip is confirmation thereof, depending on policy. 2) The (very loose) idea is that there will be a global peer membership list, hopefully embedded in the chain, but maybe not, and then each peer will pick a subset of all peers to gossip with, randomly selecting at each round to probabilistically beat byzantine nodes 3) I'm not sure what this means. Because the block generation can actually be done via the raw ledger, and it is only validation/confirmation of the correctness, we get to eliminate the requirement of order, so it is at least a drastically simplified consensus problem

yacovm
2016-09-22 14:09
Thanks, I understand more clearly now @jyellick . 1) OK 2) our gossip component can provide membership information, maybe we could use that? I think its better than storing membership in the chain because membership is something that changes pretty dynamically and some nodes are alive now and offline in the next hour, etc. 3) your answer to (1) answers this, its autonomous because the validated block is only a function of the raw block, and input of peers isn't needed IIUC. I was concerned that it was something like: "peer A tries to suggest to peers B, C, and D a certain validated block, and they can accept or reject, etc. etc. until they all reach the same result"

cca
2016-09-22 15:08
@yacovm - re 3) the output from the "consensus service" defines the tip of a hash chain; a peer needs to receive this directly from the consensus service, or the consensus service has signed this. afterwards the construction of the chain is only about getting the right blocks, because the blocks form a hashchain, and this is unique given the tip. this holds for raw blocks as well as for validated blocks. what you state as your concern should not arise.

simon
2016-09-22 15:16
@vukolic told me that there will not be gossip?

yacovm
2016-09-22 15:24
@cca wait, I thought it also feeds the blocks to the peers doesn't it?

vukolic
2016-09-22 15:44
No gossip for new blocks

vukolic
2016-09-22 15:45
For filling the gap we may still want to use p2p communication

vukolic
2016-09-22 15:45
If you call that gossip - then it is still a possible option

vukolic
2016-09-22 15:45
To complement fill-in-the-gap from consensus

cca
2016-09-22 15:55
dont understand "no gossip for new blocks" - how are these blocks disseminated? shouldnt there be a mechanism to inform 1000s of peers what the 10s (max) of consenters decided? i thought this was the goal of having that

yacovm
2016-09-22 16:30
yeah @vukolic , how are the blocks going to get from the consensus to the peers if not via our gossip network?

vukolic
2016-09-22 17:02
in v1 via direct connection of peer to consensus

vukolic
2016-09-22 17:02
every peer

vukolic
2016-09-22 17:02
post v1 we may plug in gossip

vukolic
2016-09-22 17:03
(not my choice - I am just relaying)

vukolic
2016-09-22 17:04
I presume once gossip solution was ready we could try to bring it back for v1

vukolic
2016-09-22 17:04
but it is my own view

yacovm
2016-09-22 17:04
I hope there will be a view change then

matanyahu
2016-09-22 21:11
in the current version of Fabric, is it possible to dynamically add new peers (vps/nvps) into an already running network?

matanyahu
2016-09-22 21:12
from my understanding this is not possible right now when PBFT consensus is used

tuand
2016-09-22 21:13
for v0.5, v0.6 of hyperledger fabric, the number of peers is statically set at startup

yacovm
2016-09-22 21:13
I don't think so, there is that parameter- N in the pbft config

yacovm
2016-09-22 21:14
in order for this to work, the file somehow needs to be reloaded and the PBFT module needs to be "refreshed"

matanyahu
2016-09-22 21:14
which equals to restarting the network

tuand
2016-09-22 21:15
for the next architecture, we've broken things out into endorser/orderer/commiter components

tuand
2016-09-22 21:16
so question on adding a peer becomes adding an endorser or adding a committer

matanyahu
2016-09-22 21:17
but I assume that the Fabric GA will be capable of dynamically extending network with new peers

matanyahu
2016-09-22 21:18
and now it's a question of editing yml files and restarting docker-compose

matanyahu
2016-09-22 21:18
right?

tuand
2016-09-22 21:20
can't tell you about fabric GA since I'm not privy to IBM's plans

matanyahu
2016-09-22 21:21
but that's the goal

tuand
2016-09-22 21:21
don't think restarting docker-compose is equivalent to dynamically adding peers

matanyahu
2016-09-22 21:22
redoing docker-compose will reload yml files which will effectively launch new peers if these files were edited beforehand. But this means the network will be interrupted

matanyahu
2016-09-22 21:22
that's my understanding of the current status

tuand
2016-09-22 21:24
there's discussion on boostrapping a new endorser/committer so those can be added to a running network

tuand
2016-09-22 21:26
dynamically adding an orderer depends on the specific consensus protocol being used, how to notify the peers, when to checkpoint and so on ... longer term discussion I think

matanyahu
2016-09-22 21:27
so now it's safer to define the final number of peers before the network will be deployed

matanyahu
2016-09-22 21:28
and then just assign a given peer to a new member of an existing business network which joined after the private blockchain service was already deployed

matanyahu
2016-09-22 21:29
if for business reasons a new member will accept joining the network on a condition that it will receive VP capabilities

matanyahu
2016-09-22 21:29
or any kind of capabilities that require direct participation in the maintenance of consensus

matanyahu
2016-09-22 21:29
a contractual question, not a technical one

smartyalgo
2016-09-22 21:32
has joined #fabric-consensus-dev

tuand
2016-09-22 21:32
i'll let others chime in on this

matanyahu
2016-09-22 21:33
thanks anyway :slightly_smiling_face:

matanyahu
2016-09-22 21:33
otherwise, what is the current optimal number of peers that can be running in a network without impacting network and pbft-based consensus performance?

tuand
2016-09-22 21:35
turns out that for the existing code, the bottleneck is starting/running the docker container for the chaincode

tuand
2016-09-22 21:36
more discussion over on #performance-benchmark ... most people are running 4 peers+membersrvc at this point but that's probably because we're all developers exploring how to do chaincode

matanyahu
2016-09-22 21:37
in Vukolic's paper "The Quest for Scalable Blockchain Fabric: Proof-of-Work vs. BFT Replication" it is claimed that >=20 nodes is optimal for BFT but this is not HL-specific.

matanyahu
2016-09-22 21:38
thanks, I will pop-in there

yacovm
2016-09-22 21:38
is there a proof why?

tuand
2016-09-22 21:41
@vukolic ^^^

yacovm
2016-09-22 21:50
i asked matanyahu since he's already here and i'm curious enough to ask, but too lazy to download the paper and read it.

matanyahu
2016-09-22 22:37
@yacovm : "However, having been invented in the context of replicating traditional appli- cations, such as databases, for fault-tolerance, BFT protocols were never really tested thoroughly for their scalability beyond, say, n = 10 or n = 20 nodes, in particular in the light of the fairly modest performance targets of many blockchain applications."

matanyahu
2016-09-22 22:40
"As we have already discussed, the major challenge for BFT protocols that prevents their wider adoption in blockchain is their scalability in terms of the number of nodes. Stellar [44] is an ongoing effort aimed at removing unanimously accepted membership lists from BFT protocols, while maintaining the other BFT advan- tages over PoW. Other approaches target the BFT scalability without changing membership assumptions. These include optimistic BFT protocols [52, 3] which feature linear communication complexity in the “common case” and resort to ex- pensive O(n 2 ) communication among nodes featured by classical protocols such as PBFT [10] only if the network and the process fault pattern are particularly infavorable. However, even optimistic BFT have a resource and communication overhead when compared to crash-tolerant replication protocols (e.g., [37, 31, 50]), which are better proven in practice and may serve as a baseline for BFT."

matanyahu
2016-09-22 22:43
@matanyahu uploaded a file: https://hyperledgerproject.slack.com/files/matanyahu/F2EV9EWRE/selection_072.png and commented: Node Scalability - as per Vukolic's paper

simon
2016-09-23 07:56
where does it say that >= 20 is optimal?

simon
2016-09-23 11:35
should we expose null requests to the "application"?

simon
2016-09-23 11:36
i.e. should null requests be empty batches, or no batch at all

simon
2016-09-23 11:36
if they are no batch, we somehow need to persist all data related to it outside of the chain

simon
2016-09-23 11:37
basically a batch with signatures, just without hash chain

simon
2016-09-23 11:37
might as well just use an empty batch instead?

vukolic
2016-09-23 11:50
@simon - what do we do for censorship of requests (i.e., request liveness) currently in sBFT?

simon
2016-09-23 12:15
nothing

simon
2016-09-23 12:15
we could hook up the request timer

simon
2016-09-23 12:15
but then we limit ourselves to one outstanding request at a time

vukolic
2016-09-23 12:18
I thought about this on the plane a lot

vukolic
2016-09-23 12:18
we need a mechanism that is essentially independent of a BFT/XFT protocol

vukolic
2016-09-23 12:20
for: 1) reliable broadcast of client's request, 2) liveness/termination, and possibly 3) elimination of (some) duplicate requests

vukolic
2016-09-23 12:20
I have an idea how to design such a thing - but let's discuss first in person it is easier

simon
2016-09-23 12:26
also filtering invalid requests

vukolic
2016-09-23 12:27
that is also needed but is separate from this

vukolic
2016-09-23 12:27
BTW - I spoke to folks from a company very interested in using HL

vukolic
2016-09-23 12:28
they do not want to know invalid transactions that appear on RL

vukolic
2016-09-23 12:28
they MUST know about them

vukolic
2016-09-23 12:29
so for example they have a requirement that consensus does not "swallow" invalid tx

simon
2016-09-23 12:29
but only those that are signed?

vukolic
2016-09-23 12:29
sure - malformed requests could/should be dropped

vukolic
2016-09-23 12:29
but semantically invalid transactions - they need to know about

vukolic
2016-09-23 12:32
BTW - that mechanism for reliable/broadcast and request liveness should also incorporate flow control

vukolic
2016-09-23 12:32
and then

vukolic
2016-09-23 12:33
one can easily change protocols - but this thing would stay - hopefully irrespective of the protocol

simon
2016-09-23 12:35
what do you mean by flow control?

vukolic
2016-09-23 12:37
among other things - avoiding DoS from clients drowning the consenters/primary with requests

vukolic
2016-09-23 12:39
I wonder can this even be a library that is called by consensus protocols

vukolic
2016-09-23 12:39
that would require more design...

simon
2016-09-23 12:44
yea, that would be nice

vukolic
2016-09-23 12:45
in that case a consensus protocol does not communicate with clients at all

vukolic
2016-09-23 12:45
but fetches requests from the flow control component

vukolic
2016-09-23 12:46
(obviously every replica runs locally the flow control component)

vukolic
2016-09-23 12:46
and on commit from the consensus protocol - there is an event to flow control component which does with committed requests what it needs to do

vukolic
2016-09-23 12:46
view change interaction is less obvious and maybe protocol dependent

vukolic
2016-09-23 12:47
but I'd like we eventually have such a component

vukolic
2016-09-23 12:49
and doing this for every protocol specifically is a nonsense

vukolic
2016-09-23 12:49
it must be as generic as possible

simon
2016-09-23 12:49
yea

simon
2016-09-23 12:50
how do you intend to flow control clients?

simon
2016-09-23 12:50
it needs to be deterministic

vukolic
2016-09-23 12:50
let's discuss over whiteboard and then later forward here if meaningful

simon
2016-09-23 12:52
okay

tuand
2016-09-23 13:13
@vukolic @jyellick @kostas @sanchezl are you guys in building 500 today ? should we try for a consensus face to face before marko has to fly away ?

vukolic
2016-09-23 13:15
marko is already in zurich...

vukolic
2016-09-23 13:15
had to fly away yesterday evening

tuand
2016-09-23 13:16
what !? didn't even have a chance to say bye ... oh well, back to virtual mode

simon
2016-09-23 13:16
hm, how do i detect that i am out of date?

simon
2016-09-23 13:16
i could collect all checkpoints

simon
2016-09-23 13:20
the problem is that i might have N-1 checkpoints for different seqnos from N-1 replicas

simon
2016-09-23 13:21
so i think the better solution is:

simon
2016-09-23 13:21
if i receive a seqno i think is wrong, i drop the connection

simon
2016-09-23 13:22
when i reconnect, there is a handshake, and the other side gives me a set of signatures for its last checkpointed batch

simon
2016-09-23 13:23
then i do a state transfer

simon
2016-09-23 13:23
and i continue

vukolic
2016-09-23 13:25
I am not following

vukolic
2016-09-23 13:27
in general (not PBFT specific) one could always take f+1st highest (per block height) checkpoint message

vukolic
2016-09-23 13:27
and figure out one is late

vukolic
2016-09-23 13:27
this means at least one correct replica has a commit at that height

simon
2016-09-23 13:28
yes

vukolic
2016-09-23 13:28
if you are (sufficiently) behind

vukolic
2016-09-23 13:28
you can start looking for state transfer

vukolic
2016-09-23 13:28
other policies are imaginable...

vukolic
2016-09-23 13:29
but this is one example

matanyahu
2016-09-23 13:33
@simon : this is what I assumed from the paper. I am simply curious, where are we with regards to limits of PBFT-driven network scalability which would not affect residual performance in terms of tx/s.

simon
2016-09-23 13:33
the less replicas the higher performance

matanyahu
2016-09-23 13:42
obviously but this is a technical explanation. From business perspective, I can imagine that over time founding members of the business network allow newcomers to join and for different reasons, these would like to become qualified members (NVPs/VPs). If a functional requirement of members is to maintain 100s tx/s then at some point this will hit a wall due to a growing number of replicas. Therefore, an architectural decision would assume that new participants to the network would not be qualified to become full members but rather consume blockchain indirectly, through APIs.

simon
2016-09-23 13:47
yes

yacovm
2016-09-23 15:57
anyone here? I have a question regarding the block commit validation policy

yacovm
2016-09-23 15:57
(next arch, of course)

ghaskins
2016-09-23 15:58
one of the things that are interesting here and in other distributed networks (thinking dynamo) is they often have the desirable performance properties that they converge on the fastest nodes in the network (rather than the slowest)

yacovm
2016-09-23 15:59
isn't dynamo using sharding/consistent hashing?

ghaskins
2016-09-23 15:59
so, you absolutely want to use discretion in the admittance policy, but the good news is a small degree if mistakes can be tolerated

ghaskins
2016-09-23 15:59
@yacovm i am not referring to that aspect of the system

ghaskins
2016-09-23 16:00
purely that something like PBFT or dynamo protocol tend to operate at the speed of the faster portion of the network that meets minimum quorum

ghaskins
2016-09-23 16:00
rather than the weakest link

yacovm
2016-09-23 16:00
you mean the speed of progress is the speed of the fastest write-quorum available

ghaskins
2016-09-23 16:01
(or read, in the case of dynamo at least)

ghaskins
2016-09-23 16:01
but yes

ghaskins
2016-09-23 16:01
though I suppose that is likely also true elsewhere

ghaskins
2016-09-23 16:03
what i mean is the admittance risk is reduced by the virtue of admitting one node (or a small number of nodes) doesn’t necessarily expose the network to an unanticipated reduction in throughput per se

ghaskins
2016-09-23 16:03
the slow nodes will be the ones disregrarded

ghaskins
2016-09-23 16:03
doesn’t mean you shouldn’t be concerned, monitor, and/or enforce

abhishekseth
2016-09-26 07:31
has joined #fabric-consensus-dev

soldat
2016-09-26 14:41
has joined #fabric-consensus-dev

g_alexander
2016-09-27 09:13
has joined #fabric-consensus-dev

niubwang
2016-09-27 12:54
has joined #fabric-consensus-dev

niubwang
2016-09-27 12:54
hi guys, when i add a new validator peer (using PBFT), the new peer can't sync blocks from the others, who can help me?

simon
2016-09-27 12:54
hi

simon
2016-09-27 12:55
you cannot do that dynamically

simon
2016-09-27 12:55
you will have to shut down the whole network, configure all for the new number of validators, then restart

simon
2016-09-27 12:56
then use the network as usual, and the new peer should sync up eventually

niubwang
2016-09-27 13:03
@simon hi, is this by design? i want to dynamically add new peers with not shut down the whole network

simon
2016-09-27 13:04
not implemented at the moment

simon
2016-09-27 13:04
will come in 1.0

simon
2016-09-27 13:04
i think

garisingh
2016-09-27 13:04
eventually in 1.0 - yes

niubwang
2016-09-27 13:07
i want use the fabric for business, is that mean now i can't use this? when the 1.0 will come?

simon
2016-09-27 13:12
you can, you just can't change the set of validators without stopping the network for a moment

niubwang
2016-09-27 13:19
@simon for example, if i run 5 validators, then one of them is shut dowm for a moment, if this peer is restarted, can it sync blocks form others?

simon
2016-09-27 13:19
if it is just shut down and restarted, yes

simon
2016-09-27 13:20
but 5 validators is usually not as good as 4

niubwang
2016-09-27 13:31
@simon what i want to do, is i want to run some validators in my server , and i want to the other guest users to run some validators, so they can get the block data http://too.as you mean, now i can't add validators for new guest users

simon
2016-09-27 13:32
i think you want non-validator peers

simon
2016-09-27 13:32
which never really worked, i think

niubwang
2016-09-27 13:34
i think the non-validator peers now can't sync blocks too

niubwang
2016-09-27 13:39
i want the guest user can save the block data locally

claytonsims
2016-09-27 19:11
has joined #fabric-consensus-dev

jyellick
2016-09-27 21:00
@muralisr @dave.enyeart I've created an epic around supporting embedding orderer configuration in the raw ledger https://jira.hyperledger.org/browse/FAB-495 if you would like to take a look

jyellick
2016-09-27 21:02
@elli You might also have an opinion

elli
2016-09-27 21:02
has joined #fabric-consensus-dev

dave.enyeart
2016-09-27 21:06
Jason, these would be transactions on the ‘main’ ledger right? As opposed to a side ledger for orderer config

jyellick
2016-09-27 21:07
Correct

jyellick
2016-09-27 21:07
There could be a side ledger to help orchestrate for an ordering network, but this needs to be per chain I believe

dave.enyeart
2016-09-27 21:07
and i assume ordering service would need to read the ledger state right

jyellick
2016-09-27 21:10
Well, that is the question

jyellick
2016-09-27 21:11
You'll notice in FAB-499 I was intentionally non-specific

jyellick
2016-09-27 21:11
If we pull in the ledger, and do real MVCC parsing, then we'll need to bring some amount of stuff along

jyellick
2016-09-27 21:13
@simon I'd especially like your input in regards to FAB-499. If the orderer is going to need to understand how to apply a fabric type transaction (or least pieces of it), what is your opinion of simply pulling in the peer support for it, vs trying to build something lighter and more ad-hoc? Basically, the idea would be that the ordering service would be a stripped down version of the 0.5 framework (with determinism), where the only transaction type which actually executes is a reconfiguration transaction, but we re-use the peer code for doing validation of endorsement, updating the database, MVCC etc. This seems a bit heavy handed and maybe more than we really want the ordering service to do. On the other hand, it seems like it could really lend itself well to code re-use, so that we are not solving the same problems twice.

dave.enyeart
2016-09-27 21:14
With a naive approach we’d push the entire state database to raw ledger side… we hadn’t planned on that previously, and i don’t think you want the entire state database actually. Maybe a subledger with state database for orderer config that everybody shares? And for the main system ledger, ordering service doesn’t keep the state database?

jyellick
2016-09-27 21:14
@kostas @tuand @sanchezl Your opinions also welcome

jyellick
2016-09-27 21:14
I would say absolutely no to pushing the entire state database to the raw ledger side

jyellick
2016-09-27 21:14
If by that you mean to the orderer

jyellick
2016-09-27 21:15
We would just want the ledger for the orderer configuration system chaincode

dave.enyeart
2016-09-27 21:15
right, that would be the naive approach, which i agree we wouldn’t want

jyellick
2016-09-27 21:15
Which seems easy enough if we only 'execute' the orderer configuration transactions

garisingh
2016-09-27 21:21
@jyellick - while I think on one side you are saying reuse some "fabric" mechanisms for the PBFT ordering service in order to provide the ability to update the "config" (membership, etc) of the PBFT ordering service, it also sounds like you are proposing a very tight coupling of the ordering service and the fabric

garisingh
2016-09-27 21:22
which is not what we want IMHO. But hey - what do I know?

garisingh
2016-09-27 21:23
To me, I should be able to use any ordering service without consuming it from the fabric

jyellick
2016-09-27 21:23
@garisingh The fabric and the ordering service need to agree on the orderer configuration. And, in order to synchronize it with the chain, it most likely needs to be embedded in the chain

jyellick
2016-09-27 21:24
My initial thought was, embed some binary blob, which we describe the format of, and done

jyellick
2016-09-27 21:24
But as we talked about it, things like "Well, we'll need to validate that the right signatures are here" and so forth, it sounded a lot like endorsement

garisingh
2016-09-27 21:25
So I am not saying that an implementation of the ordering service could not choose to use pieces of the fabric, but what I am saying is that I should be able to use that ordering service from something other than the fabric

jyellick
2016-09-27 21:26
Ah, yes, absolutely. I am not suggesting you could not do that

jyellick
2016-09-27 21:26
Maybe, the fabric transaction format is too complicated to reasonably expect a non-fabric application to support

jyellick
2016-09-27 21:27
As I was trying to come up with a way of generating this fabric-type-transaction, I realized it was going to be quite a pain, which why I started thinking about pulling the common code bits in

jyellick
2016-09-27 21:28
But ultimately, the ordering service takes in blobs, and spits out batches/blocks of blobs

jyellick
2016-09-27 21:28
Just in the case that that blob happened to be the special configuration transaction type blob, it would do some other stuff

jyellick
2016-09-27 21:30
(I'm also a little concerned with the overhead of inspecting every blob to see if it's a special type. However, since we're already going to have to be hashing and checking signatures, it seems like not a lot of additional overhead)

garisingh
2016-09-27 21:31
well you could avoid that by having a special "channel" for config transactions

garisingh
2016-09-27 21:31
that's what lots of messaging servers do

garisingh
2016-09-27 21:31
they have "system" topics and queues

jyellick
2016-09-27 21:31
Right

jyellick
2016-09-27 21:31
The question is synchronizing that configuration to the other chains

wil.pannell
2016-09-27 21:31
has joined #fabric-consensus-dev

jyellick
2016-09-27 21:32
How do you know that at block 30 that you should be looking for a different set of signatures?

jyellick
2016-09-27 21:32
You could embed which block the configuration change applies to in the special system chain, but how do you know you have an up to date enough copy of the system chain?

jyellick
2016-09-27 21:33
Having it within the same message stream solves a lot of problems, though I admit, it creates some too. I'm certainly open to other approaches.

garisingh
2016-09-27 21:33
and doesn't the problem become worse with multiple channels? we would be guaranteeing order per channel but not total order across channels

jyellick
2016-09-27 21:34
Yes, every channel would need to get a 'reconfiguration transaction' when the configuration changed

jyellick
2016-09-27 21:34
I would expect for reconfiguration events to be pretty rare, but it's a concern

garisingh
2016-09-27 21:34
so for example, if for PBFT we decide that you need to receive f+1 of the same message before "delivering" to the raw ledger and then at some point you increase the number of ordering nodes, how do you handle that?

garisingh
2016-09-27 21:35
okay - so you insert a tx in the stream of every channel?

jyellick
2016-09-27 21:35
Right

jyellick
2016-09-27 21:36
I don't see any way around that, especially if you decide you want to scale your ordering service. You might have a single service hosting 10k channels, and you decide you want to move half of the channels to a new set of nodes. Not sure how else you do it.

garisingh
2016-09-27 21:36
and I guess it does not have to be "atomic" across all channels - it just has to make it into all channels

jyellick
2016-09-27 21:36
Right

garisingh
2016-09-27 21:39
but this makes the consumer side logic a bit complex in some cases - for example I can listen for multiple channels on the same physical connection, but now my handlers for each channel might have a different policy for a short period of time. On the other hand, from the consuming / committing peer, they really should not be aware of any of this if we handle it in the "ordering service client" piece

jyellick
2016-09-27 21:39
Right. And, I would argue, absolutely your ordering configuration might deliberately be different for different channels.

garisingh
2016-09-27 21:39
and I guess these config transactions would be a block with a single tran?

jyellick
2016-09-27 21:40
You could implement them that way, though I see no immediate harm in including them in a batch with other trans

garisingh
2016-09-27 21:41
true enough. although kind of nice to treat them a bit special

jyellick
2016-09-27 21:41
Yes, making them easy to spot is a plus

jyellick
2016-09-27 21:44
At the end of the day, the fabric and the ordering service both need to have the same view of "who's ordering" for the same blocks. How we synchronize this data is up in the air to me. I think it make sense to send it across as part of the chain. It looks a whole lot like a fabric transaction, because we will want to ultimately validate it with signatures etc., but really, the fabric transaction format is probably more complex than necessary for it. I'd also be open to making a new transaction type (the data structure already supports this) which is much simpler just for config. But, then we have to re-invent some stuff which is already handled by the existing transaction. I'd really love to be persuaded one way or the other.

jyellick
2016-09-27 21:45
In discussions last Thursday I think @jeffgarratt was a big proponent of re-using the fabric transaction, maybe he wants to voice his opinion here too.

jeffgarratt
2016-09-27 22:12
@jyellick @garisingh yes, I think we can use transaction now as I would in the future like the option to have the consensus service request a change through endorsement of the associated channel

cca
2016-09-28 07:30
@jyellick , this "orderer" component/service, is this the same as consensus service? ie., those nodes that run a distributed atomic broadcast / consensus protocol?

cca
2016-09-28 07:32
if yes: every implementation of this will have to come with a "stub" library to be run by the other peers that receive the output from the consensus service; this component is specific to the choice (whether solo or pbft, say) and will know how to parse these special tx that contain re-configuration info; so that it can update its list of N signing node keys, say.

simon
2016-09-28 09:33
i think that whole new design is way complicated and not layered properly anymore

simon
2016-09-28 09:33
it started out as a quest to layer components and isolate them

simon
2016-09-28 09:33
and now, during implementation phase, we're again adding requirements last minute

simon
2016-09-28 09:35
channels only makes sense as a (de)multiplex part right at ordering ingress/egress. everything inside should treat all of these as blobs, no matter which channel they came in.

simon
2016-09-28 09:38
given that we have all these requirements replicating the work of a fabric peer, i'd say throw away the design and integrate the consensus service back into the fabric peer

simon
2016-09-28 09:38
this time with better abstraction

simon
2016-09-28 09:40
then we save on the raw ledger on consensus side, raw ledger on fabric peer side, we only have a validated ledger, we don't have to validate signatures in the orderer, because the peer already does so, and reconfiguration is just a headache once and not twice

simon
2016-09-28 09:40
and retain the submitting peer so that you don't have to propagate the configuration change to all clients (sdk) as well

simon
2016-09-28 09:42
and i don't know what these channels are supposed to be. are they sidechains?

kostas
2016-09-28 09:42
Effectively, yes.

simon
2016-09-28 09:43
but they're not being hooked into the main chain?

simon
2016-09-28 09:43
and they're also not designed from first principles

garisingh
2016-09-28 10:00
once again, maybe we need to go back and define the purpose of the ordering service and basic features it needs to support - aka first principles

simon
2016-09-28 10:01
yes

simon
2016-09-28 10:02
because right now it looks like the ordering service is its own blockchain, and then there is a second blockchain that interprets the first one

garisingh
2016-09-28 10:02
What is the ordering service? What features does it need to support? How do "clients" interact with it? etc RIGHT - and I do not like that part one bit

simon
2016-09-28 10:02
but they're not in the same process, yet they need to share a lot of configuration

garisingh
2016-09-28 10:02
personally I don't think that they need to have this tight coupling of sharing

simon
2016-09-28 10:03
it may be better to follow what everybody else seems to be doing, which is combine app+consensus+storage in one process

garisingh
2016-09-28 10:03
I think that the ordering service needs to provide "meta information" for its "clients" to use and those clients can decide what they want to do with that information

simon
2016-09-28 10:04
look at reliable delivery, for example

garisingh
2016-09-28 10:04
actually, more people are moving away from that - Tendermint, Axoni, etc

simon
2016-09-28 10:05
as a client submitting something to bft, i need to connect to at least f+1 consensus nodes

garisingh
2016-09-28 10:05
agreed

simon
2016-09-28 10:05
so i need to know the whole set of consensus nodes (which can change)

garisingh
2016-09-28 10:05
agreed

simon
2016-09-28 10:05
so how do i do this?

simon
2016-09-28 10:05
i can't go and ask one bootstrap node "what other nodes are there?"

garisingh
2016-09-28 10:06
BTW - that's exactly what we do today :disappointed:

garisingh
2016-09-28 10:06
in the current fabric

simon
2016-09-28 10:07
yep

simon
2016-09-28 10:07
i know

simon
2016-09-28 13:18
@garisingh: i just talked with @vukolic

simon
2016-09-28 13:19
it seems to me that the ordering service is really *the* blockchain implementation

simon
2016-09-28 13:19
and the fabric peer is an application server that uses this blockchain

simon
2016-09-28 13:39
i think that is important to realize

jyellick
2016-09-28 14:20
@simon Completely agree with that assessment

jyellick
2016-09-28 14:27
Ordering service builds a blockchain, peer network runs application logic which uses the blockchain in some interesting way. It so happens they may also choose to write this application logic output into another blockchain structure (the validated chain), but as we've pointed out before, that's not strictly necessary and is really only a tool to help with auditing

jyellick
2016-09-28 14:43
Assuming there's agreement that the orderer network configuration should be on the chain, and that the users of the orderer network should read this data to know how to validate what is returned from the ordering service. How do we specifically convey this information (ie, what datastructure)? The three options I see: 1. Use some orderer specific crafted data structure, modifying the peer to understand this new type 2. Define a new fabric transaction type, and embed some orderer specific data structure, modifying the peer to understand this new type 3. Utilize the existing fabric transaction type, modifying the orderer to understand this existing type

jyellick
2016-09-28 14:52
As I see the pros/cons 1. The type will be minimally complex, will require no knowledge of the fabric introduced into the orderer, but will push complexity into the peer and any tools which want to consume the chain, concepts like versioning and signatures would need to be re-invented 2. The type will be nearly minimally complex, will require very limited knowledge of the fabric introduced into the orderer, complexity is pushed into the peer, but it's more obvious for tools as the type is at least defined, concepts like versioning and signatures would need to be re-invented 3. The type is very complex, requires some implementation specific knowledge of the fabric introduced into the orderer, but complexity is very low at the peer and simple for tools operating on the chain, the format is already well thought out to be cryptographically correct, supports versioning, already has a signature scheme

jyellick
2016-09-28 14:57
Also open to opinions that my assumptions are entirely invalid. Or other options I failed to include. @cca @simon @vukolic @garisingh @jeffgarratt @sanchezl @tuand @elli @kostas ^^^

elli
2016-09-28 15:02
Hi, for either option i would say that such transactions would need to be authenticated (through signatures coming from a threshold of orderers or some ordering service administrator), no?

jyellick
2016-09-28 15:03
The ordering service should never deliver an invalid configuration transaction, so I actually think it's okay to punt on this for now.

elli
2016-09-28 15:03
That is if such ordering service messages have to do with changes in the configuration of the orderers, including adding/deleting members, the transactions that cause this to happen should be authenticated as an ordering service policy imposes, no?

jyellick
2016-09-28 15:04
Ultimately though, the ordering service will need to be able to authenticate a valid reconfiguration transaction, which would require signature validation. The key distinction being if the peer sees a reconfiguration transaction on the raw chain, it knows the consensus service agrees with it. If the consensus service sees a reconfiguration transaction, it can simply discard it, for now (aside from genesis).

jyellick
2016-09-28 15:05
Right, I think you have it absolutely correct @elli the signature scheme is for the orderers to validate it, not the peers.

elli
2016-09-28 15:05
Let me rephrase the justification of why signatures may be needed.

elli
2016-09-28 15:06
To my understanding there should be some entity or entities that are authorized to reconfigure the network.

elli
2016-09-28 15:08
That is either an ordering service admin, or the orderers jointly. Now in the first case, the first one who submits a transaction to the orderers needs to be authenticated. No? Meaning that orderers should not accept such transactions if they come from anyone.

elli
2016-09-28 15:09
Or i should ask, how would you see the reconfiguration taking place?

elli
2016-09-28 15:09
is it the case that the admin reconfigures all peers that a change CHANGE should take place, and each orderer tries to submit a transaction that reflects that CHANGE

elli
2016-09-28 15:10
then if CHANGE is in the queue of tasks to be done by the other orderers then they all agree. Is this how it would be done?

jyellick
2016-09-28 15:10
Orderers must all start with the same configuration, or they cannot form a network. This configuration is encoded as a transaction and embedded in the genesis block, and because this configuration was manually propagated by an administrator to each orderer, no signature check is really needed. If the administrator wanted to be malicious, they could change keys etc. Eventually, a live orderer network will need to be able to reconfigure, and in this case, yes, there must be some way to validate that the instruction is valid according to whatever policy the orderer service wants. I would think a threshold of signatures from whatever entities control the network. However, since we don't need to support reconfiguration out of the gate, we can simply have a policy of "no new reconfiguration transaction is valid", in order to simply get the network up and running. This prevents a malicious peer from forcing a reconfiguration of the network.

elli
2016-09-28 15:11
Aha, ok

elli
2016-09-28 15:11
then agreed :slightly_smiling_face:

elli
2016-09-28 15:11
But then why would you need the options 1-3?

elli
2016-09-28 15:12
Is it only to say "how would we express this static configuration"?

jyellick
2016-09-28 15:13
Correct. Today it would be to express static configuration, however, because we anticipate dynamic configuration in the future, I thought we should pick or design the datastructure in anticipation of that

jyellick
2016-09-28 15:13
It seems odd and arbitrary to have different datastructures for initial (static) and later dynamic configuration specification

elli
2016-09-28 15:13
Ah then, for dynamic reconf you would need authenticated messages no?

jyellick
2016-09-28 15:13
Yes

elli
2016-09-28 15:15
So, i would just add that for (1) and (2) one would need to do a lot of the work already done on (3)

jyellick
2016-09-28 15:15
Yes, the thing that I like about (3) is exactly that we don't have to do that work. The thing I do not like about (3) is that it pulls in a lot of other fabric artifacts, like the MVCC+Postimage data model

elli
2016-09-28 15:15
Also addition and/or removal of CAs of the endorsement network (if this is still valid statement) would need to be communicated ot the orderers, and changes to the orderer config would need to be understood by the committers no?

elli
2016-09-28 15:16
aha

jyellick
2016-09-28 15:17
Yes, I was not including those in this scope, but maybe it makes sense to.

jyellick
2016-09-28 15:18
The feeling I am getting is that maybe we need to modify our transaction definition, to allow transactions without the strict MVCC+Postimage data model, so that we can retain the signature validation, versioning, crypto correctness, etc., but not have to pull all of the fabric pieces in

elli
2016-09-28 15:18
for the ESCC and VSCC of the ordering service

elli
2016-09-28 15:18
cant it be something that requires only one signature, e.g., client signature?

elli
2016-09-28 15:18
client in this case would be the admin

jyellick
2016-09-28 15:19
Today the ordering service has no *SCC, because that would be pulling in pieces of the fabric we were hoping not to

elli
2016-09-28 15:19
and move all the execution to the vscc side?

elli
2016-09-28 15:19
would it make sense to run a client only?

elli
2016-09-28 15:20
on the ordering service (admin side)

elli
2016-09-28 15:20
that talks to the endorser nodes of the blockchain the ones that endorse system transactions

elli
2016-09-28 15:20
understand the issue

elli
2016-09-28 15:23
However, i do see, that even if you dont want to call it vscc, there has to be some small piece of code that would parse the tx-s meant for the ordering service, and which would decide if these tx-s are valid or not

elli
2016-09-28 15:24
updates of certificates of the admin, or of orderer certificate validity are all operations that need to take place at the same logical time for all orderers no?

jyellick
2016-09-28 15:24
Yes, that is the thing, it all looks very much like VSCC and assorted other *SCC tasks. Same logical time yes, but that is easy since they are doing consensus

elli
2016-09-28 15:25
So if you have some sort of super light weight VSCC that would still work right, and no ESCC

elli
2016-09-28 15:26
specifically for these transactions (the only ones processed by orderers)

jyellick
2016-09-28 15:28
Right

jyellick
2016-09-28 15:30
Do you think it would be possible to modify the transaction type so that it it preserves the aspects you are concerned about without pulling in the fabric details?

elli
2016-09-28 15:33
could be yes

simon
2016-09-28 15:33
seems we're doing everything twice

simon
2016-09-28 15:34
if the orderer was part of the peer, would you just look at some system table for consensus?

simon
2016-09-28 15:35
i don't quite understand the difference between (1) and (2)

jyellick
2016-09-28 15:37
Yes, I too am concerned we are doing everything twice.

jyellick
2016-09-28 15:38
The difference between (1) and (2) is basically that the fabric defines a `Transaction` which basically has two fields: `type int` and `contents []byte` (names might be wrong, but this is the idea)

jyellick
2016-09-28 15:38
Today, the fabric only supports `type = 0` but it is there if we ever wanted to support a radically different transaction type

jyellick
2016-09-28 15:38
So, we could define a new `type = 1` with a different payload which is not the `endorsed proposal`

jyellick
2016-09-28 15:39
So (1) would be, we define a new data type which is not a type of `Transaction`, and (2) would be we define a new `Transaction` type. They are extremely similar, just whether we reuse the exiting envelope

simon
2016-09-28 15:40
right, but a consensus-admin-endorsed consensus configuration setting

simon
2016-09-28 15:41
which has a transparent part: peers (addresses, certificates), byz F, and it has an opaque part which is specific to the consensus implementation

jyellick
2016-09-28 15:41
Right

simon
2016-09-28 15:42
seems fine

simon
2016-09-28 15:42
and the correctness of this is authenticated towards the peer because consensus signed off on it

simon
2016-09-28 15:42
jyellick: i just talked with @vukolic about state transfer in sbft

simon
2016-09-28 15:43
and how it probably can be implemented via consensus client

jyellick
2016-09-28 15:43
@jeffgarratt Really believes that the peer network is going to want to see endorsements from its endorsers on the orderer reconfiguration. I think that's not realistic, but, be aware that there is support for this.

elli
2016-09-28 15:43
Additional note: for enforcing access control over different channels, e.g., adding or removing listeners on a specific channel, orderers would still need to process some form of transactions that the fabric undersrtands no?

simon
2016-09-28 15:43
i.e. connect to other atomicbroadcast server to retrieve missing blocks

jyellick
2016-09-28 15:43
Agreed

jyellick
2016-09-28 15:43
That would have been my first instnct

jyellick
2016-09-28 15:44
instinct*

simon
2016-09-28 15:44
jyellick: then just add their signatures into the multisig

jyellick
2016-09-28 15:45
But I still don't have a literal data structure and ingress method for reconfiguration. Maybe you can just answer some questions.... 1. Does reconfiguration come in as a broadcast message?

simon
2016-09-28 15:46
i gotta run, so i'll revisit this later

simon
2016-09-28 15:46
i guess in the model yes

simon
2016-09-28 15:46
when we share it with the peer

jyellick
2016-09-28 15:47
Okay, understand you need to run. Please think on this. I will try to think on what you've said as well and we can pick this up tomorrow?

jyellick
2016-09-28 15:52
@elli I have been doing my best to push channels out of my head, but yes, there will need to be some mechanism for reconfiguring them. I don't know whether we should bulid support for channels into the atomicbroadcast api, or whether we should simply create a new service which wraps it and handles those details. Channels is a sudden piece of significant additional complexity at the ordering side

elli
2016-09-28 15:52
Indeed :slightly_smiling_face:

vukolic
2016-09-28 16:12
I am inclined to avoiding state transfer in sbft

vukolic
2016-09-28 16:13
We do not need it as there is no state that cannot br tetrivef from a single block

vukolic
2016-09-28 16:14
I should not drive and slack... Leads to typos

vukolic
2016-09-28 16:18
Catching up boils down to obtaining a weak checkpoint certificate and picking up view number and sequence number from there

vukolic
2016-09-28 16:20
From that point on filling in the gap can be done lazily, in the background

vukolic
2016-09-28 16:20
Sort of a 'lazy state transfer'

yacovm
2016-09-28 17:17

yacovm
2016-09-28 17:18
In case someone from zurich wants to take a look at it

jyellick
2016-09-28 18:51
@yacovm This code path should be dying in v1, hopefully not worth looking at, we can add a `t.Skip` if it is causing problems

yacovm
2016-09-28 18:59
ok. It failed my build so I was just playing a concerned fabric citizen

cca
2016-09-28 19:30
@jyellick: let me disregard "channels" for now, only look at what is in Next-Consensus-Arch doc: different chaincodes. There can be a "system chaincode" that interprets such reconfig transactions. Yes they need to be somehow signed or otherwise conform to the endorsement policy of that system chaincode. This is not the first system that has this, I invite everyone to look at those that did it before: Zookeeper (https://zookeeper.apache.org/doc/trunk/zookeeperReconfig.html) with a paper here (https://www.usenix.org/system/files/conference/atc12/atc12-final74.pdf) , not security-conscious, but otherwise similar. Or BFT-SMaRT (http://www.di.fc.ul.pt/~bessani/publications/dsn14-bftsmart.pdf and http://bft-smart.github.io/library/) in BFT-land.

cca
2016-09-28 19:32
BTW, the adapter that "receives" ordered tx at the peers outside consensus will also need to understand dynamic consenter node changes, otherwise, how could it trust this output? (= know which consenter nodes to trust). This should not be exposed to the peer because it is really specific to the impl. of consenters.

cca
2016-09-28 19:37
(And here is a 10-year old PhD thesis, in which I was a bit involved, where reconfiguration is also described: https://www.ideals.illinois.edu/handle/2142/11121 then click on PDF, see Chapter 5.)

garisingh
2016-09-28 19:48
``` BTW, the adapter that "receives" ordered tx at the peers outside consensus will also need to understand dynamic consenter node changes, otherwise, how could it trust this output? (= know which consenter nodes to trust). This should not be exposed to the peer because it is really specific to the impl. of consenters. ``` So the question is whether or not this info needs to be written to the raw ledger on the peer side. If it does, I am fine with that - but it should look like just another raw ledger "block" / whatever we call it. The "adapter" should receive the update, not process any more delivers until it reconfigures itself and updates the raw ledger with this new info. That's why I think that the config info should be delivered in a batch by itself

garisingh
2016-09-28 19:49
this way you know the point in time when the new config went into effect but basically the peer does not care anything about it

donovanhide
2016-09-28 19:52
Just following along out of interest and don’t want to distract, but is the idea that all changes to the list of trusted peers by a single peer is logged for eternity in the blockchain and all other peers can inspect those selections? If so, it’s an interesting approach, differing from Ripple, where their equivalent (the Unique Node List) is a private matter.

garisingh
2016-09-28 19:55
@donovanhide - I think what is in that config would depend on the ordering service implementation. For example, with a PBFT-based ordering service, if we want to provide some type of BFT broadcast, each peer would need to connect to at least f+1 ordering nodes to make sure it was getting the correct info. So to do that, it needs to get this info from somewhere. Initially, this would likely be a bootstrap config, but if the ordering service adds additional nodes, then f+1 might be different and the peers would need to know that

donovanhide
2016-09-28 19:58
Interesting… just to share some other ideas from Ripple, the bootstrap stage happens via round-robin DNS: ```host http://r.ripple.com http://r.ripple.com has address 169.53.155.44 http://r.ripple.com has address 54.186.248.91 http://r.ripple.com has address 174.37.225.50 http://r.ripple.com has address 54.86.175.122 http://r.ripple.com has address 169.55.164.22 http://r.ripple.com has address 54.186.73.52 http://r.ripple.com has address 54.84.21.230 http://r.ripple.com has address 184.173.45.44``` and the protocol has means for a peer to share the peers it knows about. https://github.com/ripple/rippled/blob/906ef761bab95f80b0a7e0cab3b4c594b226cf57/src/ripple/proto/ripple.proto#L220-L255

donovanhide
2016-09-28 20:00
Co-incidentally, the very same problem is being addressed concurrently: https://github.com/ripple/rippled/pull/1842

garisingh
2016-09-28 20:00
we can do that as well - but the question is how can you trust that a single peer provides you with the right list? :wink:

cbf
2016-09-28 20:00
bwahahaha

garisingh
2016-09-28 20:00
but I hear ya

garisingh
2016-09-28 20:00
I don't think we need to solve world hunger the first time around

garisingh
2016-09-28 20:01
with permissioned networks honestly there is a level of trust

donovanhide
2016-09-28 20:02
Well, the Ripple approach initially used the Anonymous Diffie Hellman cipher in openssl to stop sybil attacks, but I don’t think I’ve got the stomach to go into the detail of all that :slightly_smiling_face:

donovanhide
2016-09-28 20:03
I like the idea of making public which peers trust which peers, and cementing it though. Makes for a very transparent system.

yacovm
2016-09-28 20:03
how can you have a sybil attack? we have the membership service that everyone subscribes to

donovanhide
2016-09-28 20:04
How do you know the membership service is who you think it is? :slightly_smiling_face:

yacovm
2016-09-28 20:04
it's a CA

yacovm
2016-09-28 20:04
the same way your browser trusts verisign

donovanhide
2016-09-28 20:04
Just one?

yacovm
2016-09-28 20:05
browsers work in the same way don't they? they have pre-installed certificates

yacovm
2016-09-28 20:05
and you have a chain of trust from some "important CA" down to other smaller CAs

donovanhide
2016-09-28 20:06
The point, I guess, is if you have just one certificate sitting in one place, then that could be a SPOF. But I was really more interested in following along the consensus discussion :slightly_smiling_face:

yacovm
2016-09-28 20:08
consensus or membership?

donovanhide
2016-09-28 20:08
Well, the dynamic nature of the peers’ trusts involved in consensus.

yacovm
2016-09-28 20:10
but why is that? if you have a membership service (you do, in HL) that gives peers signed public/private key pairs that can't be forged, you can't impersonate a peer and you can't also spoof a peer that isn't registered. Where does the trust of the peers come to play here?

kostas
2016-09-28 20:11
I trust peer A when it says C and D are in the list, when it's B and E.

yacovm
2016-09-28 20:11
in the list of what?

donovanhide
2016-09-28 20:11
Well, say I’m Bank A, and Bank B has started doing something shady, and I want to stop them doing shady stuff on our shared network as soon as possible.

kostas
2016-09-28 20:12
In the list of orderers.

yacovm
2016-09-28 20:12
oh, you mean membership information about roles of entities?

yacovm
2016-09-28 20:12
who's a conensus peer and who's not?


yacovm
2016-09-28 20:15
I didn't think about it for too long but I'd go the following approach: You have a consensus service which consists of membership set S_0. A peer comes and asks to join. The peers of S_0 run a consensus algorithm that results in the "next" membership set S_1. And in induction, for S_i, etc. etc. if the consensus is byzantine tolerant, you've solved the problem. The only problem with that is bootstrapping which needs to be addressed in another way

yacovm
2016-09-28 20:15
oh that's what Gari said (the bootstrapping)

donovanhide
2016-09-28 20:17
I guess the fun question occurs when a peer only wants to trust a subset of the other peers. That’s the problem that Ripple and Stellar tried to solve on a public network, but maybe that doesn’t apply here...

kostas
2016-09-28 20:19
I honestly don't see the big problem but then again I am fine with relaxing our assumptions and being practical. (As you have pointed out we have a CA in here already.) Bake that bootstrap list in the peers, and ship a new binary/genesis block when this list needs to be updated. (For any *new* nodes that want to join, and when the original bootstrap set has zero overlap with the new one.)

yacovm
2016-09-28 20:20
I think that hyperledger is "complicated" enough as it is without having to tackle bootstrapping trust issues

kostas
2016-09-28 20:21
It's not like these entities transacting on the network meet once in 2016 and never ever again talk to each other, or have lost the ability to coordinate manually if need be.

garisingh
2016-09-28 20:35
they probably won't meet until 2017 :wink:

garisingh
2016-09-28 20:35
maybe around March?

donovanhide
2016-09-28 20:46
Can a user have more than one Transaction Certificate active at the same time?

vukolic
2016-09-28 23:03
@garisingh yes - once we have the consensus reconfiguration the membership change will be written to raw ledger

vukolic
2016-09-28 23:05
As for the initial consensus service discovery - I always imagined a genesis block in which initial set of consenters (and other bootstrap) info is written

vukolic
2016-09-28 23:05
this genesis block has a hash

vukolic
2016-09-28 23:05
which is the identifier of the blockchain instance

vukolic
2016-09-28 23:07
so nodes join a specific blockchain by downloading (from anywhere) a genesis block and comparing its hash to the blockchain identifier

garisingh
2016-09-29 00:37
@donovanhide: Yes. You can request transaction certificates in batches. You can choose to use a different one for every transaction or you could keep a few in flight. Of course they will expire at some point as well

grapebaba
2016-09-29 03:14
seems most like Kafka's architecture, producer,consumer,broker all distributed separate

kostas
2016-09-29 03:22
Right but keep in mind that we are working on a different set of trust assumptions here. For instance in Kafka, as long as one of the brokers you are connected to is up and running, you're good to go, even if the other brokers in your config have died. (That one broker will let you know of the new broker set.) This won't fly in the BFT case.

vukolic
2016-09-29 12:00
@kostas if we map Kafka nodes to consenters - what is the mapping

vukolic
2016-09-29 12:00
1 ZK server per consenter that I suppose will be the case

vukolic
2016-09-29 12:00
but what about brokers?

simon
2016-09-29 13:40

kostas
2016-09-29 13:53
@vukolic: The set of nodes in our ordering service maps to the set of Kafka brokers that are replicating a partition (that partition's leader and that partition's followers). Producers of this system should do `acks=all`, `unclean.leader.election.enable` for the broker should be set to `false`, and `min.insync.replica` for the broker should definitely be > 1 as well.)

kostas
2016-09-29 13:54
> 1 ZK server per consenter that I suppose will be the case

kostas
2016-09-29 13:56
The number of ZooKeeper servers is not (does not/should not be) equal to the number of brokers. 3/5/7 ZK servers nodes is all you need depending on the fault tolerance you wish to have.

kostas
2016-09-29 13:56
That said, each broker creates a ZK _ephemeral_ node when it's created.

kostas
2016-09-29 13:56
Let me know if there are any more questions.

simon
2016-09-29 13:59
i have no idea what any of this means

simon
2016-09-29 13:59
but i'm happy that we have support for a well tested system

simon
2016-09-29 14:01
do we have a scrum today?

kostas
2016-09-29 14:02
Dialing in to the scrum now.

yacovm
2016-09-29 14:03
seems so

kostas
2016-09-29 15:46
Someone who's better in Docker-land than I am (@jeffgarratt?). Is it unrealistic for me to expect the `ORDERER_KAFKA_BROKERS` ENV var for `orderer0` to be parsed as `IP of Kafka:9092` in the snippet below?


kostas
2016-09-29 15:47
Note that the ORDERER_KAFKA_BROKERS ENV var is meant to be a slice, which is why it's written that way.

jeffgarratt
2016-09-29 15:51
@kostas yes, it will be read as is

jeffgarratt
2016-09-29 15:51
but that host should resolve to the proper IP

jeffgarratt
2016-09-29 17:40
can you hear me?

tuand
2016-09-29 17:40
what @jeffgarratt ?

jeffgarratt
2016-09-29 17:41
sorry, wrong channel :slightly_smiling_face:

lory
2016-09-30 03:17
has joined #fabric-consensus-dev

shannon_wie
2016-09-30 05:56
has joined #fabric-consensus-dev

vukolic
2016-09-30 09:25
@kostas if this is the case (not every ZK server is consenter) - we have to be careful about how we "sell" the scalability of Kafka-orderer

vukolic
2016-09-30 09:25
strictly speaking if you have 100 brokers but only 5 ZK servers

vukolic
2016-09-30 09:25
your kafka-orderer will not be live with 3 consenter faults

vukolic
2016-09-30 09:26
if that happen to be ZK server faults

vukolic
2016-09-30 09:26
perhaps not a huge problem but just mentioning

frankyclu
2016-10-01 09:32
has joined #fabric-consensus-dev

frankyclu
2016-10-01 09:37
hey guys, not sure if this is already known, this error is fairly easy to create in PBFT part of fabric 0.6

frankyclu
2016-10-01 09:44

garisingh
2016-10-01 10:01
Hi @frankyclu - do you mean it is fairly easy to force a peer to be out of sync?

frankyclu
2016-10-01 10:40
sorry I was gonna to add more detail but I didn't know you get up this early :slightly_smiling_face:

frankyclu
2016-10-01 10:43
the panic happens (which will basically shut down the node) after this sequence: one peer get out of sync due to network connection error, then as the peer try to sync up (at this point rest of the network has already exceeded the peer's high watermark), it will also get chat messages from the bootstrap peer (which it will then generate: error : *Peer FSM cannot handle message (DISC_GET_PEERS) with payload size (0) while in state: created_* every few seconds). If I stop and start the peer (which will usually get ride of "FSM cannot find message error"), the network will then attempting a view change request because they think the problem peer is faulty, however it will just keep on attempting for several minutes w/o any results

frankyclu
2016-10-01 10:59
at this point if you start another wave of transactions, the problem peer will go panic @garisingh

frankyclu
2016-10-01 11:01
I think this problem will be common once the peers get deployed into physically separated VMs

yacovm
2016-10-01 11:52
Panic? Can you put here the stack trace?

frankyclu
2016-10-01 12:21
@yacovm it's pasted above

yoshihara
2016-10-01 12:29
has joined #fabric-consensus-dev

yacovm
2016-10-01 12:40
oh, didn't notice (was using slack on android)

garisingh
2016-10-01 18:19
sorry - had to run out this morning

garisingh
2016-10-01 18:21
the code actually exits with a panic - so this is intentional from that perspective. The peer has found itself in a state where it should not continue to participate. But I would think that if you then restart that peer, it should detect that it's out of sync and then initiate state transfer

kostas
2016-10-01 21:15
> the network will then attempting a view change request because they think the problem peer is faulty, however it will just keep on attempting for several minutes w/o any results

kostas
2016-10-01 21:16
This should probably "the peer will then attempt a view-change request" right?

kostas
2016-10-01 21:18
If that's the case, note that the peer won't actually join and actively participate the network until the network (eventually) switches its view to the one that this complaining peer wanted to all along. (It may take a while, or even forever.)

kostas
2016-10-01 21:40
Until that happens however, note that this complaining peer will have its state synced even though it doesn't participate in ordering. (Long and somewhat convoluted explanation of how that happens: It will be able to identify any weak checkpoint sets above its high watermark, mark itself as out of date, and move its low watermark accordingly. Then, upon receiving the next weak checkpoint cert, it will state transfer to it. This process will repeat periodically, assuming the rest of the network progresses normally.)

kostas
2016-10-03 07:49
@vukolic I had missed this message, sorry. Yes, this is a good observation. If you no longer have a majority quorum in your ZK ensemble and, say, the partition leader crashes, you'd be in trouble.

ckeyer
2016-10-03 07:53
has joined #fabric-consensus-dev

kostas
2016-10-03 09:17
Given that we're asked to post updates here --

kostas
2016-10-03 09:17
I finished the work on the kafka-orderer this past Friday https://github.com/kchristidis/fabric/tree/kafka-orderer-complete/orderer (https://jira.hyperledger.org/browse/FAB-32?focusedCommentId=19084 - will post the changeset once the Vagrant image gets upgraded to Go 1.7)

kostas
2016-10-03 09:17
Was also asked how Kafka deals with reconfiguration, I posted about this here: https://jira.hyperledger.org/browse/FAB-496?focusedCommentId=19092

kostas
2016-10-03 09:17
Up next: review the SBFT changeset

simon
2016-10-03 09:18
yey

simon
2016-10-03 09:18
\o/ i get a review

kostas
2016-10-03 09:23
Sorry, yeah, I was planning to get to it last week but I bumped into some misconfiguration issues with the Kafka/ZooKeeper images (that only manifested themselves after I tried to test via Docker Compose in Vagrant)

simon
2016-10-03 09:45
no worries

simon
2016-10-03 09:47
i'm trying to figure out what to do with backlog messages and reconnect events

simon
2016-10-03 09:47
ah suddenly it seems to make sense

simon
2016-10-03 11:14
@elli: right now i store a list of blobs to represent the signatures on a block

simon
2016-10-03 11:14
@elli: do i also have to store some kind of identification for the signature, so that you can test it against a key?

simon
2016-10-03 11:14
how is this done usually?

elli
2016-10-03 11:19
Hi, @simon : yes, this is correct; one would need to include the certs (or references to the certs) of the users who are signing.

elli
2016-10-03 11:19
users/nodes.

simon
2016-10-03 11:24
the consenter, in this case

simon
2016-10-03 11:24
thanks, somehow i missed that

simon
2016-10-03 11:24
what is a customary way of doing this?

simon
2016-10-03 11:25
attach the cert or the cert fingerprint?

simon
2016-10-03 11:25
do you prefer protobuf or asn1?

elli
2016-10-03 12:34
So it depends. Cert fingerprint could suffice if every peer who is supposed to evaluate the signatures is already in posession of the Certs.

simon
2016-10-03 12:34
ah i see

elli
2016-10-03 12:34
If e.g., certs of valid orderers are announced through the blockchain (since i am guessing you refer to orderer signatures) , then a fingerprint would suffice.

simon
2016-10-03 12:35
yes

simon
2016-10-03 12:35
yea they have to be announced

elli
2016-10-03 12:36
re: protobuf or asn1 i would invite @adc to the discussion.

adc
2016-10-03 12:37
has joined #fabric-consensus-dev

elli
2016-10-03 12:38
I would say ASN1 but it makes sense that we are consistent with the rest of signatures produced.

simon
2016-10-03 12:38
yea

simon
2016-10-03 12:39
probably all structures that should be stored would be better in ASN1

simon
2016-10-03 12:39
but not my decision

alankhlim
2016-10-03 13:14
has joined #fabric-consensus-dev

simon
2016-10-03 13:47
@jyellick: so that's the difference between batch and block: batch refers to replica ids, block contains replica certificates/fingerprints

jyellick
2016-10-03 14:20
@simon I think that depends on which 'batch' you're referring to, the batches emitted from the ordering service are blocks (only called batches to differentiate them from the validated blocks), it could be we want additional data around a PBFT batch, though if we could avoid additional fields reusing the block structure seems preferable

jyellick
2016-10-03 14:22
FYI, I tagged you on this, https://gerrit.hyperledger.org/r/#/c/1361/ but have not seen any feedback from you on it. It seems many people do not want to use a Merkle tree for the block contents hash, which is fine by me, but I know you picked Merkle tree for your impl, so wasn't sure if you had some other reason for it

tuand
2016-10-03 14:26
posting my updates as Kostas is doing earlier today ... added default endorser and validator system chaincodes to feature/convergence ... these don't do much right now except to help get the end-to-end skeleton going ... we will be adding more capabilities as we get v1 going https://gerrit.hyperledger.org/r/#/c/1367/

cca
2016-10-03 15:07
@simon, @mvu, @tuand, @kostas, everyone: Have a look at the message formats posted here -- https://hyperledgerproject.slack.com/files/adc/F2JKXGXEU/protobufmessagesandflow.pdf


tuand
2016-10-03 15:29
@cca `@mvu` did not resolve to a slack user

cca
2016-10-03 15:50
@vukolic

jyellick
2016-10-03 16:10
@cca I've been looking at those as they have been in development, @elli I do not see a date, is that newer than what I have last seen?

elli
2016-10-03 16:37
Hi, @vukolic: it should be, as we just completed it today :slightly_smiling_face:

jyellick
2016-10-03 16:38
@elli is there a place we can provide feedback on them?

vukolic
2016-10-03 16:38
There should be a JIRA issue IMO

elli
2016-10-03 16:43
fabric-crypo channel?

elli
2016-10-03 16:44
crypto*

elli
2016-10-03 16:44
there is also a jira item indeed, but i think the fabric-crypto channel is easier if you want more ppl seeing it.

yaoguo
2016-10-03 16:52
has joined #fabric-consensus-dev

simon
2016-10-03 16:57
elli: in short, what validation steps does the orderer have to perform?

simon
2016-10-03 16:58
and over what is the signature

elli
2016-10-03 16:58
How would you define short? :smile:

simon
2016-10-03 16:59
which fields have to be validated

simon
2016-10-03 16:59
more than just check the signature?

simon
2016-10-03 16:59
does this signature field opaquely contain the signing identity (cert)?

jyellick
2016-10-03 17:07
@simon my very rough understanding from @elli is that that signature field is the signature of the identity contained in the proposalheader, assuming all proposal headers have the same identity, and potentially some more complicated scheme if they differ. I hope I am wrong, because that seems very expensive to check and complicated to me

jyellick
2016-10-03 17:09
In some ways, I actually think that what we need is a higher level crypto primitive message, something which embeds an identity and signature, as well as whatever things are required to prevent replay (timestamp/nonce/ttl, whatever), a type, and a payload

elli
2016-10-03 17:10
So do you mean what is checked by VSCC (assuming the default one)

elli
2016-10-03 17:10
?

jyellick
2016-10-03 17:10
For the orderer, which remember does not have a VSCC

elli
2016-10-03 17:10
ok, i see

jyellick
2016-10-03 17:10
The orderer is going to get a message from a client, presumably today that is a transaction, and the orderer needs to be able to tell "is this actually from a client who is allowed to submit messages"

elli
2016-10-03 17:11
and the purpose of the orderer to check signature is which exactly?

elli
2016-10-03 17:11
is it DoS related?

jyellick
2016-10-03 17:11
Because each orderer can validate the connections from clients via TLS easily enough

jyellick
2016-10-03 17:11
However, because we allow for byzantine orderers, one orderer (say the primary) might lie and say "Yes, this transaction came from a client who's TLS cert is authorized" but instead make up transactions

jyellick
2016-10-03 17:12
Of course they would be filtered out at the peer side, but the byzantine primary could essentially stop network progress by making up nonsense

elli
2016-10-03 17:13
aha

cca
2016-10-03 17:13
@jyellick: that risk of a faulty primary could exist, but for the usecases we look at mostly (consortium), it seems irrelevant. that is, building in a defense against it that is always executed slows down unnecessarily

elli
2016-10-03 17:13
but then one could add in the transaction message the certificate of the client, that would enable the orderer to do the check easily

elli
2016-10-03 17:13
but VSCC would need to do the comprehensive checks

elli
2016-10-03 17:13
that we discuss in the charts

jyellick
2016-10-03 17:14
@cca This was a concern from @vukolic I just replied to a note where he said that we need to do this check

cca
2016-10-03 17:14
... putting all of this together gives an overly complex result, it seems to me.

cca
2016-10-03 17:14
his note may not have considered the existence of TLS certs and the closed group

jyellick
2016-10-03 17:16
@cca Another reason I thought having the messages be signed is, imagine a byzantine client of the ordering service injects malformed non-sense, but has a validated TLS cert

vukolic
2016-10-03 17:16
Orderers could check tcert which prevents primary to issue an arbitrary bogus request

vukolic
2016-10-03 17:17
Hovewer tcerts do not prevent primary from replaying old legitimate requests

vukolic
2016-10-03 17:17
If we had clients sign with ecerts this is a non issue

vukolic
2016-10-03 17:17
But tcerts inherently prevent tra cking clients

vukolic
2016-10-03 17:18
And we cannot store all past requests to prevent Replay

jyellick
2016-10-03 17:18
@vukolic @cca I can see how the primary could make the chain increase in size rapidly, but even with replay or forgery, won't it be voted out from primary if it is not including the pending requests of the other backups?

vukolic
2016-10-03 17:20
Yes but it could do that

vukolic
2016-10-03 17:20
Plus replay

vukolic
2016-10-03 17:21
Preventing replay with tcerts is difficult

vukolic
2016-10-03 17:21
We may reason about time

vukolic
2016-10-03 17:21
Either logical or "real"

vukolic
2016-10-03 17:22
That may help...

jyellick
2016-10-03 17:22
There is the notion of time embedded within the proposalheader I believe

jyellick
2016-10-03 17:24
@elli What is the purpose of having multiple proposals embedded within one transaction? What happens if some of the proposals are valid and some are not?

simon
2016-10-03 17:24
i don't quite care about this transaction format

simon
2016-10-03 17:25
i think we should just add signature field in the atomic broadcast `Broadcast` ingress call

simon
2016-10-03 17:25
field or argument

jyellick
2016-10-03 17:25
That does not handle replay

simon
2016-10-03 17:25
the orderer should not know anything about what it is ordering

simon
2016-10-03 17:25
then add a sequence number

vukolic
2016-10-03 17:25
If we add some time notion

vukolic
2016-10-03 17:26
Like a sequence number

vukolic
2016-10-03 17:26
That helps

jyellick
2016-10-03 17:26
Sequence number per client?

vukolic
2016-10-03 17:26
No

simon
2016-10-03 17:26
yes

simon
2016-10-03 17:26
no?

vukolic
2016-10-03 17:26
With tcerts that does not work

vukolic
2016-10-03 17:26
Woth ecerts that works

simon
2016-10-03 17:26
ah

simon
2016-10-03 17:26
why do we need that whole tcert business?

vukolic
2016-10-03 17:26
Anonymity...

vukolic
2016-10-03 17:27
Unlinkability

jyellick
2016-10-03 17:27
So a client will need to connect to the ordering service... using the t-cert?

vukolic
2016-10-03 17:27
N stuff

jyellick
2016-10-03 17:27
And will need to disconnect and reconnect between to attempt to prevent linking?

vukolic
2016-10-03 17:27
No clue

vukolic
2016-10-03 17:27
Previously

simon
2016-10-03 17:27
yea, that's all silly

vukolic
2016-10-03 17:27
When we had a submitting peer

vukolic
2016-10-03 17:27
This was less pronounced

simon
2016-10-03 17:28
should the peer also connect via TOR?

vukolic
2016-10-03 17:28
On a dial up

jyellick
2016-10-03 17:28
If you don't want a byzantine orderer to associate all the transactions from a single address with a single identity... probably

kostas
2016-10-03 17:28
For the reason you listed below (we'd still need to keep a giant list of all past TXs), it would still be an issue right?

vukolic
2016-10-03 17:28
From public phone booth

vukolic
2016-10-03 17:29
1 per client @kostas

vukolic
2016-10-03 17:29
Not that huge

vukolic
2016-10-03 17:29
In real life

jyellick
2016-10-03 17:29
Yes, this is what the original Castro paper suggests if I recall

vukolic
2016-10-03 17:30
Yes this is the classical approach

jyellick
2016-10-03 17:30
Then do we require that the TLS cert for the connection and the TLS signature for the message match?

simon
2016-10-03 17:32
not for clients at the moment

simon
2016-10-03 17:32
for replicas yes

simon
2016-10-03 17:33
we can still do replay protection for tcerts

simon
2016-10-03 17:33
we discard the state when the tcert expires

simon
2016-10-03 17:34
or we require a client to submit a new request directly to 2f+1 correct replicas (i.e. send it to 3f+1)

kostas
2016-10-03 17:34
(@vukolic: FWIW, I think we're dealing with the same complexity in any case. Whether you deal with 100 tcert'd TXs (that you can place in 10 buckets) versus 10 clients with 10 ecert'd TXs each, it still the same accessing cost, if you do the mapping and splitting to buckets right.)

simon
2016-10-03 17:35
and we occasionally inform the primary which request (hashes) we have outstanding

jyellick
2016-10-03 17:35
@simon I really like the idea of the orderer being totally agnostic to the message contents, but am struggling with how this can mesh with reconfiguration. If the orderer configuration must be on the chain, how can the orderer treat all messages as opaque bytes?

simon
2016-10-03 17:36
that's the only message type it knows

jyellick
2016-10-03 17:37
So the orderer does inspect every message, and check whether it's a reconfiguration one or not (and if it's a reconfiguration one, it might choose to discard it if it is not valid)

simon
2016-10-03 17:37
yes

kostas
2016-10-03 17:37
That's how I thought it would work as well.

jyellick
2016-10-03 17:38
Okay, that was what I was thinking, but to me that is not 'totally opaque'

kostas
2016-10-03 17:39
It is not. The only other way is a side-channel specific for reconfiguration, but we don't want this for the reasons we have mentioned several times.

jyellick
2016-10-03 17:39
Because the peer needs to understand the reconfiguration as well, would you agree it makes sense to try to re-use the fabric transaction format for the reconfiguration message?

simon
2016-10-03 17:41
i disagree

simon
2016-10-03 17:41
that format is way complicated

jyellick
2016-10-03 17:42
This is why I would like to see the format simplified

simon
2016-10-03 17:42
also the protobufs don't look like they've been created with signing in mind

jyellick
2016-10-03 17:43
The problem I see, is that the peer wants to define all of its ledger interfaces to expect something of type fabric transaction for every slice of bytes in the block

simon
2016-10-03 17:43
i'd prefer stable pieces of data (i.e. what gets stored, not rpc messages) to be in ASN1

jyellick
2016-10-03 17:43
Which I think seems perfectly reasonable, I think it's odd to say "Everything will be of type fabric transaction, except for those that are of type orderer transaction"

kostas
2016-10-03 17:43
What exactly prevents the peer from doing a type switch on the received messages?

kostas
2016-10-03 17:43
Receive bytes, unpack, type switch.

jyellick
2016-10-03 17:44
You could, but that switch then propagates throughout the rest of the ledger interfaces

simon
2016-10-03 17:44
so AB block payloads need to contain a type field

jyellick
2016-10-03 17:45
At the end of the day, they are both transactions, and actually very similar, they both require a proposed change, and a set of signatures from those who are allowed to permit the change

simon
2016-10-03 17:45
which probably makes sense

simon
2016-10-03 17:45
because it allows you to multiplex different applications over the blockchain

simon
2016-10-03 17:45
well, i'm out

jyellick
2016-10-03 17:45
Block payloads need to contain a type field?

jyellick
2016-10-03 17:46
IE, the block gets a type? Or each of the block contents get a type?

kostas
2016-10-03 17:48
A block-level type is/seems less expensive. If the orderer creates a reconfig message, it ships it on its own block whose type is set to "config", otherwise it batches a bunch of TXs and sets the type of the block to "tran".

jyellick
2016-10-03 17:48
Interesting

kostas
2016-10-03 17:49
This is also in-line with Gari's suggestion to keep things simple and devote an entire block/batch to the reconfiguration.

jyellick
2016-10-03 17:49
To me, it seems far more natural to have a higher level Transaction message which contains a type field, and a bare minimum of security stuff and some bytes

jyellick
2016-10-03 17:50
The problem I see, is that if we go and invent a new 'reconfiguration' format, that we're going to have to re-invent the security

jyellick
2016-10-03 17:51
Maybe it's a worthwhile cost, I certainly do not like the idea of trying to generate the fabric transaction format as is, it is just too complicated

vukolic
2016-10-03 17:52
@kostas with tcerts preventing replay is O(no of reqs) and with ecerts O(no clients)

vukolic
2016-10-03 17:52
Not the same

jyellick
2016-10-03 17:52
O(no of reqs in the epoch) which at least bounds it somewhat

vukolic
2016-10-03 17:53
Whats an epoch

kostas
2016-10-03 17:53
@vukolic: I may well be missing something but in the end, whether you sign with an ecert or a tcert, it's an individual transaction you need to keep track of right?

vukolic
2016-10-03 17:53
No

jyellick
2016-10-03 17:53
I know they talked about adding an explicit epoch field, but here I am meaning it to be the window of time for which the t-cert is valid

kostas
2016-10-03 17:53
eCerts dictate how you do buckets (one bucket per client) but nothing prevents you from having an efficient scheme for tCerts as well.

vukolic
2016-10-03 17:53
With ecerts you keep the last one for each client

kostas
2016-10-03 17:53
Ah, let me know what I'm missing then.

vukolic
2016-10-03 17:54
You keep a ts per client

jyellick
2016-10-03 17:54
(Assuming t-certs expire after 4 hours, you only need to search over however many t-certs have not expired)

vukolic
2016-10-03 17:54
That is diff @jyellick

vukolic
2016-10-03 17:55
With time we can simplify replay

vukolic
2016-10-03 17:55
At the expense of introducing time

kostas
2016-10-03 17:55
If I send you TX `foo` signed with my ecert, and then TX `bar` signed with my ecert, why do you only have to keep track of `bar`? Someone may well re-introduce `foo`?

vukolic
2016-10-03 17:55
There is a logical ts with both of those

jyellick
2016-10-03 17:56
@kostas because `foo` was seqNo=7 and `bar` was seqNo=8

kostas
2016-10-03 17:57
Got it now, yup. Thanks.

jyellick
2016-10-03 17:58
@kostas So your suggestion would be we add a `Type` field to the block header which scopes the `Data` section of the block? Then what format would we use for the reconfiguration transaction?

kostas
2016-10-03 17:59
I was processing your earlier point before Marko chimed in with the eCert clarification. I see the concern.

kostas
2016-10-03 18:00
My response would be "a much simpler format", but then your response would be "but then we'll have to re-invent security", correct?

jyellick
2016-10-03 18:00
Exactly

kostas
2016-10-03 18:03
Alright so --

cca
2016-10-03 18:04
... re-joining for a minute ... you've summarized all points

cca
2016-10-03 18:04
now make sure to get that across to the fabric-crypto channel!

kostas
2016-10-03 18:04
If we agree that the reconfigs will eventually have to go through the same kind of verification checks as standard transactions, I can see why it makes sense to make the reconfig work within the current schema.

kostas
2016-10-03 18:06
The concern here is:

kostas
2016-10-03 18:07
An inexpensive way for the peers/orderers to realize they're dealing with a reconfig.

vkandy
2016-10-03 18:07
has joined #fabric-consensus-dev

kostas
2016-10-03 18:08
Then, regardless of whether reconfigs use the same schema as standard TXs, would you agree that labeling the AB block as of type "reconfig" (and having it only include this single "tx") would make filtering less expensive?

jyellick
2016-10-03 18:10
Okay, I would not agree with "easier", but in your rephrased "less expensive", yes.

jyellick
2016-10-03 18:10
However, the idea of a type scoping the contents of the block does not sit well with me

kostas
2016-10-03 18:10
Yes, that's why I edited it. "Easier" was wrong.

jyellick
2016-10-03 18:10
Especially as it is being set by the orderer. What happens if we wish to add support for yet another transaction type. Say UTXO?

kostas
2016-10-03 18:11
Nothing will change. UTXO will be treated as current standard TXs are now.

kostas
2016-10-03 18:11
Meaning that your type switch will always be:

kostas
2016-10-03 18:12
`select { case "reconfig": doFoo(); default: doBar() }`

jyellick
2016-10-03 18:12
I don't believe having to check the type of every transaction is going to be especially expensive though, considering there is a signature verification taking place already

kostas
2016-10-03 18:12
(Where `doBar()` is where the UTXO processing would take place.)

jyellick
2016-10-03 18:12
Less expensive yes, meaningfully so? I'm not so sure

kostas
2016-10-03 18:13
I believe the only way to test this accurately is by running benchmarks.

kostas
2016-10-03 18:13
Anything else is cheap hypotheses.

kostas
2016-10-03 18:13
Let's narrow down a couple of our best proposals and I'll be happy to write these benchmarks.

jyellick
2016-10-03 18:15
Okay, so as I see it, there are basically 3 options: 1. Re-use the fabric transaction format (hopefully a simplified version that doesn't yet exist, but the current one will do for now) and use the proposalheader type field as the type switch 2. Create a transaction wrapper with basic security on it, with a type switch, and embed the transaction with one type, or the reconfiguration with another 3. Add the block type header

vkandy
2016-10-03 18:16
Hello I asked this in #consensus channel but was advised to ask here. I am trying to understand *when* a block is created. The spec (https://github.com/hyperledger/fabric/blob/master/docs/protocol-spec.md#3473-committing-and-rolling-back-transactions) talks about `CommitTxBatch` but it's not clear at what point transactions are bundled to create a block. Also, what prevents a same node from creating blocks always?

kostas
2016-10-03 18:16
I'm not entirely clear on #2, but we can write the protobufs tomorrow to iron the details out and then I can benchmark.

jyellick
2016-10-03 18:18
@vkandy It sounds like you are talking about the existing 0.6 release, if not the answer will be different. When using the PBFT consensus algorithm, there is a leader election process which designates one of the nodes as the primary, who picks the contents of the next block. The primary sends out the contents, and after a three phase protocol, the network has come to consensus about what the contents of the next block will be. This block is now committed.

jyellick
2016-10-03 18:20
I'm not sure what you mean about a 'same node from creating blocks always'. The guarantee with PBFT (under the fault assumptions) is that all nodes which are following the protocol will all produce the same sequence of blocks (with the same ordered contents).

vkandy
2016-10-03 18:27
@jyellick thank you - that clears up some confusion I had. I was looking at this file (https://github.com/hyperledger/fabric/blob/master/consensus/executor/executor.go) Regarding the second part I guess what I am asking is can the same node be elected as a leader always therefore a single node produces blocks always. Also, if you could point me in the right direction - how often is a leader elected?

jyellick
2016-10-03 18:32
@vkandy You're quite welcome, and I'm happy to answer any more questions you might have. The `executor.go` file you linked to is used to coordinate execution and state transfer, and is called asynchronously from the `pbft` package. With classical PBFT, the leader is elected in round robin fashion. Once a leader is chosen, the other network members watch the leader for incorrect behavior, and if they believe the leader is not acting appropriately, vote to move to the next leader. "Inappropriately" can be defined especially as failing to make progress, censoring requests, or otherwise not following the protocol correctly. Because the rest of the network is monitoring the primary's behavior, the primary may stay the leader indefinitely, so long as it behaves correctly. Although it is not classically defined in the protocol, there is a flag in the `pbft/config.yaml` regarding periodic view change, this causes the network to switch leaders at some multiple of every `K` blocks, regardless of the leader's integrity. This will slow network throughput, but might be considered more 'fair' by some.

vkandy
2016-10-03 18:44
ah! that makes sense. So I guess the primary remaining leader indefinitely isn't an issue in a permissioned chain and given that rest of the nodes are monitoring leaders action. I was looking for a way to force each node to become a leader or at least have equal chance of becoming a leader. Thanks a bunch for this information. I'll browse the pbft package.

jyellick
2016-10-03 18:47
You're welcome, we are always here to help

jyellick
2016-10-03 20:19
@kostas @simon @sanchezl @tuand @jeffgarratt @garisingh @binhn @keithsmith Just finished a long chat with Keith about the bootstrapping issues. I did my best to try and summarize in https://jira.hyperledger.org/browse/FAB-359

simon
2016-10-04 11:30
why was this never discussed with the consensus squad?

jyellick
2016-10-04 13:10
@simon I'm not sure. I first heard about this last Friday. I believe it was discussed while Marko was visiting, but in a different 'breakout session' than I attended.

simon
2016-10-04 13:16
@keithsmith would have been beneficial to check with us

a.klenik
2016-10-04 13:37
has joined #fabric-consensus-dev

john.mccloskey
2016-10-04 14:04
has joined #fabric-consensus-dev

simon
2016-10-05 09:51
so i'll be on vacation until the 26th

simon
2016-10-05 09:51
i'd appreciate if somebody could take on the sbft integration

hgabor
2016-10-05 10:05
I can do it or drive it, if needed

simon
2016-10-05 10:10
go for it!

hgabor
2016-10-05 10:39
by sbft integration do we mean the merge of the 4 commits related to sbft and the management of corresponding JIRAs?

hgabor
2016-10-05 10:41
btw consensus people, please have a look: https://gerrit.hyperledger.org/r/#/c/1315/

simon
2016-10-05 10:41
yes, i mean that

hgabor
2016-10-05 10:42
okay I will do that

simon
2016-10-05 10:42
thanks!

simon
2016-10-05 10:42
hands over baton

hgabor
2016-10-05 10:42
welcome

simon
2016-10-05 10:42
there are also some things that still need to be addressed

simon
2016-10-05 10:42
there are some TODOs in the code

simon
2016-10-05 10:42
for which i will create jira issues

hgabor
2016-10-05 10:43
yeah I think we don't have to fix all of them in the current changesets

simon
2016-10-05 10:46
what we have should go in

hgabor
2016-10-05 10:46
yes

tuand
2016-10-05 13:05
jira issues reassign to Gabor @simon ? I glanced at the dashboard and didn't see gabor's name yet

stevenroose
2016-10-05 14:26
has joined #fabric-consensus-dev

frankyclu
2016-10-05 14:44
@kostas @garisingh thanks for the earlier explanation, the "panic" problem is actually pretty easy to duplicate (probably likely to happen in prod too) in high volume environments where 1 or some of the nodes have network problems. I probably need to test it more to get more details, what I believe I saw was the node with bad network will first find itself out of sync, then the weak certs it receives are out of its high watermark, then it will initiate view change request. However, if the network is again bombarded with txs during view change request, the problem node will go panic (I am gonna see if increasing the log size will get better luck).....anyways, a node shouldn't go panic due to network problem

frankyclu
2016-10-05 14:45
there is also another minor problem when restarting the bad node:

frankyclu
2016-10-05 14:46

frankyclu
2016-10-05 15:02
I believe it is caused by restarting the bad node too soon (before each of the other nodes has detected lost connection ). I believe it is because the good nodes still keeps the old connection handler so it will continue to send get_peer messages,, while at the sametime the restarting node will try to send hello to the good node, and then get duplicate handler exception that leads to never ending get_peer messages. @jeffgarratt @binhn you may have better ideas.... what I can do now is wait longer before restarting to ensure all nodes on the network have detected lost connection

simon
2016-10-05 15:02
tuand: i think as soon as somebody starts working on an issue, they will assign it to themselves?

simon
2016-10-05 15:16
meh, iterating over maps

simon
2016-10-05 15:16
yey nondeterminism

tuand
2016-10-05 15:16
agreed simon ... all yours @hgabor

simon
2016-10-05 15:16
i just found a nondeterministic piece of code in my test system

simon
2016-10-05 15:16
-_-

jyellick
2016-10-05 18:29
For those who are interested, I've pushed a commit to Gerrit which implements a tiny little DSL via protobuf for specifying signature validation policies https://gerrit.hyperledger.org/r/#/c/1487/

simon
2016-10-05 21:46
if somebody could fix my `connectAll()` in `simplebft_test.go` that would be wonderful

simon
2016-10-05 21:47
it is nondeterministic and occasionally breaks a test (because there is a bug in the startup code in that the request timer is not reset after `sendCheckpoint` is called in `New()`. that's a second bug that needs love @hgabor

simon
2016-10-05 21:47
ok, off to vacation

kostas
2016-10-06 01:13
@hgabor: related to your work in FAB-473, when you find some time, can you please tell me whether you can see it being useful for FAB-469? It would be nice to have something that's re-usable. https://jira.hyperledger.org/browse/FAB-469

kostas
2016-10-06 01:14
By the way, I've added two issues in JIRA for what I think are the logical next steps for the Kafka orderer:



kostas
2016-10-06 01:16
I'll wait until Monday, and then I'll also share with the mailing list along with the rest of the consensus backlog.

hgabor
2016-10-06 05:31
@kostas I will check it. @simon I will fix it

srirama_sharma
2016-10-06 05:56
has joined #fabric-consensus-dev

yacovm
2016-10-06 06:18
Does anyone know whether the blocks coming from the consensus layer are now going to be signed by every consenter (multi-signature) or each block is going to be signed only be one of them? @vukolic @simon ?

vukolic
2016-10-06 06:30
@yacovm in any solution between f+1 and 2f+1 sigs per batch make sense

yacovm
2016-10-06 06:31
Ok so multi sig

vukolic
2016-10-06 06:31
1 does not as does not guarantee much and all does not fly due to fault tolerance

vukolic
2016-10-06 06:31
Yes 1 would fly with threshold sigs but crypto is not yet there

yacovm
2016-10-06 06:31
Ok just making sure

adc
2016-10-06 08:38
@jyellick, regarding https://gerrit.hyperledger.org/r/#/c/1487/, is it possible to have a policy that says: check the signature against all the identities. This might be useful in case one wants to use ring signature. Actually, I like a lot your approach. It looks like it can be generalized and used also in other contexts, i.e. endorsement

jyellick
2016-10-06 13:22
@adc Thanks for the feedback, I'd love to add support for ring signatures if possible. It sounds like in order to support this, instead of returning a `bool` for "Signature is valid" we could return a `[]int` inndicating "Which valid signatures", then validate that `N` of the signatures out of the identities have signed?

adc
2016-10-06 13:24
@jyellick, actually it is more like this. The verification algorithm takes in input the signature, the message and all the public key in the ring group, and returns yes or no

jyellick
2016-10-06 13:25
@adc Oh, I see, I can think on how to support this, unless you have an idea off the top of your head?

adc
2016-10-06 13:26
actually, this was more a sanity check to verify that the framework is generic enough. I think it can be accommodated in multiple ways, actually

adc
2016-10-06 13:26
for instance, one can pass to the crypto helper the concatenation of all the public keys

adc
2016-10-06 13:26
so no interface needs to be changed

adc
2016-10-06 13:27
anyway, it is really cool that you can do this with protobuf :slightly_smiling_face:

jyellick
2016-10-06 13:28
Great, thanks! I thought it was a cute idea... let protobuf do the schema validation to make sure the thing is well formed, then 'execute'

adc
2016-10-06 13:29
indeed :slightly_smiling_face:


tuand
2016-10-06 14:33
can one of you guys describe where we are at ? and what other things we need to put on the table ?

tuand
2016-10-06 14:34
@jyellick ? @keithsmith ?

tuand
2016-10-06 14:34
or perhaps orderer service should go its own way and figure out just bootstrapping for orderers only ?

jyellick
2016-10-06 14:35
@tuand Totally agree, we have far too many people working in isolation on this

jyellick
2016-10-06 14:36
I'm working with @muralisr today trying to make some more concrete progress. and was also just talking with @hgabor, I'll paste what I told him here: In brief, for bootstrapping, we should push a 'special transaction' into the genesis block which contains the configuration info That way we can re-use the same mechanism for reconfiguration down the line We'll need of course the PBFT identities and config, but also the client CA certs, That's what started me down the path to https://gerrit.hyperledger.org/r/#/c/1487/ So, assuming we can get a transaction format which is easy for non-fabric stuff to produce and consume, then we need to define hopefully some sort of generic config proto, and then some pbft specific extensions to it Or at least that's the plan of attack in my head

tuand
2016-10-06 14:39
ok, pull me when you talk to @muralisr ? maybe I can work up a flow diagram ...

tuand
2016-10-06 14:40
and I don't know where @jeffgarratt stands on this either ... I know he mentioned a few concerns on FAB-359

jyellick
2016-10-06 14:40
Yes, I've talked with @jeffgarratt a bit about this, he was actually the first one to point me to FAB-359, I'm hoping that my comments are helping to address his concerns.

jeffgarratt
2016-10-06 15:25
@jyellick @tuand @muralisr I think we should focus like a laser on bootstrap

2016-10-06 15:41
@jyellick has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/avqbsligmjgkbf3ckq35ldokbqe.

jyellick
2016-10-06 15:41
Ongoing discussion of bootstrapping and endorsement policies ^above

keithsmith
2016-10-06 17:47
Regarding bootstrapping, I just updated the description of https://jira.hyperledger.org/browse/FAB-359 so pls comment

jyellick
2016-10-06 18:03
Commented, we have been using the hangout above to discuss the bootstrapping flow but broke for lunch

jyellick
2016-10-06 18:03
We'll be resuming in a bit if you'd like to join

jyellick
2016-10-06 18:10
To summarize the results thusfar, the flow was envisioned as follows: 1. Entities create CA certs for peer network 2. Entities create Self-signed certs for orderer network 3. Certs are sent to the bootstrap administrator 4. Bootstrap administrator uses bootstrapping tool to generate genesis block * Bootstrap administrator sets ordering ingress validation policy * Bootstrap administrator sets ordering egress validation policy * Bootstrap administrator sets ordering opaque config (for instance specifics of PBFT f/K/L/etc.) * Bootstrap administrator sets peer opaque config (VSCC policies, etc.) 9. Bootstrap administrator distributes the genesis block to other administrators 10. After inspection and approval, administrator installs block at orderer node and starts (orderer network now functional) 11. Admin supplies genesis hash and ordering service to peers, they connect and receive genesis block to bootstrap their configuration

jyellick
2016-10-06 18:11
@keithsmith ^

keithsmith
2016-10-06 18:15
OK, given the above, I think the only differences to what I have in FAB-359 are:

keithsmith
2016-10-06 18:16
1) The way in which the COP APIs are used at startup since you’re gen’ing the genesis block thru tooling rather than 1st start

keithsmith
2016-10-06 18:17
2) You’re saying that you use self-signed certs for orderers. I don’t think you have to require that, but certainly is the easiest to to start with from a test perspective

jyellick
2016-10-06 18:18
The other distinction I see is that COP contains membership services

jyellick
2016-10-06 18:19
And I don't think we need that embedded in the orderer or most peers

keithsmith
2016-10-06 18:19
COP is just a library in this case … and a CLI tool

jyellick
2016-10-06 18:19
They need to be able to validate policies, which is fine

jyellick
2016-10-06 18:19
But I'm wary of bringing in function like issuing certificates to something like the orderer which simply doesn't need that

jyellick
2016-10-06 18:19
It should be given a certificate at bootstrap

keithsmith
2016-10-06 18:21
You don’t have to run a cop server in an orderer. I think there is confusion.

cca
2016-10-06 20:48
@jyellick - i hope that not all 11 steps require manual typing, instead there should be a way to give the minimal info and some tool does the steps and installs all files!

jyellick
2016-10-06 21:26
@cca Absolutely there will be a tool which packages this nicely, but we need to know ultimately what the tool will be doing under the covers, I would say a simplified list is: 1. All parties generated keys/certs as needed 2. Admin feeds public keys and some initial config info into a bootstrapping tool, which produces the genesis block 3. Admin gives the genesis block back to the interested parties, who confirm they are happy with it 4. Orderer network starts and peers connect to bootstrap I think (1) is unavoidable because you do not want someone else generating your private key. (2) is more effectively 'the bootstrapping', (3) diseminates, and (4) executes. Assuming some lack of trust between the admin and the components, I'm not sure it can get much simpler

jyellick
2016-10-06 21:28
And of course for POC or trusted networks with a single admin, this can be simplified still. I see no reason we could not support deploying a network with a single point of trust in a single command.

hgabor
2016-10-07 06:52
There is a bug in the sbft implementation. Any help is welcome.

vukolic
2016-10-07 10:13
@simon ^^^

vukolic
2016-10-07 10:13
@hgabor can you pls be more specific? are you posting the issue to JIRA?

hgabor
2016-10-07 10:16
@vukolic Simon is on vacation. I am trying to deal with it. Do we need a JIRA for it? if yes, I can open one or write to the existing ones. I meant this implementation: https://gerrit.hyperledger.org/r/#/c/1315/

vukolic
2016-10-07 10:21
I am only on my mobile today so gerrit does not render nicely

vukolic
2016-10-07 10:22
The code i was looking at with simon about a week ago was on his github fork

vukolic
2016-10-07 10:22
Did he integrate back to gerrit in the meantime

hgabor
2016-10-07 10:35
yes he did

kostas
2016-10-07 11:45
@hgabor: Open up an issue for it with details, and -2 the SBFT changesets. My plan moving forward is to split my time between the Kafka work and SBFT, so I could look into it.

oiakovlev
2016-10-07 12:15
Have asked this question in #membership-services but will probably re-ask here as well as it is some sort of question about consensus in case of `bad` mbrsvc node: Hi, question regarding mbrsvc architecture in v1: what will happen if somebody gain control over one of the meberservcies instance? Theoretically it can start issue new certificates and new peers will join network, which can take a control over the network? This was an issue in v.05 - considering mbrsvc as single place of failure, right? but now we mbrsvc is distributed so how this case is prevented is there any use case description for such scenario? Which architecture should Customer use to prevent such scenario?

oiakovlev
2016-10-07 12:16
Also I have read https://docs.google.com/document/d/1TRYHcaT8yMn8MZlDtreqzkDcXx0WI50AV2JpAcvAM5w/edit# and discussions in https://jira.hyperledger.org/browse/FAB-361 maybe some other doc exist which will help me to understand the behavior here?


tuand
2016-10-07 13:17
what @jyellick @muralisr @keithsmith @sanchezl @jeffgarratt @tuand spent the day discussing ^^^

tuand
2016-10-07 13:17
I'll format and put this and above comments into jira https://jira.hyperledger.org/browse/FAB-359


jyellick
2016-10-07 14:11
@oiakovlev If someone takes control over the membership services / private keys of an entity in the network, they can transact as that entity. I think this is unavoidable. However, in v1, we allow multiple roots of trust, so usually (depending on the configuration), one corrupted root of trust would not be sufficient to take over the entire network.

oiakovlev
2016-10-07 14:12
but why? If I have cert entity can I spin up as many peers as I want?

oiakovlev
2016-10-07 14:13
as I mentioned I have read some docs on future changes but guess that still don't have the whole picture here

jyellick
2016-10-07 14:13
So, in v1 having more peers does not necessarily give any more power.

jyellick
2016-10-07 14:15
The endorsement policies, and other policies specify which certificate roots need to sign off on something.

jyellick
2016-10-07 14:15
So, controlling entity A only partially fulfills a policy that says it requires a signature from A, B, and C

oiakovlev
2016-10-07 14:17
ah, true, it is separate services now... make more sense now...

oiakovlev
2016-10-07 14:24
thanks @jyellick! Another question which documents for security scenarios/membership service exist right now except https://docs.google.com/document/d/1TRYHcaT8yMn8MZlDtreqzkDcXx0WI50AV2JpAcvAM5w/edit# and discussions in https://jira.hyperledger.org/browse/FAB-361 and presentations from https://jira.hyperledger.org/browse/FAB-37?

jyellick
2016-10-07 14:29
Although it's more focused on bootstrapping, you might find some illuminating information in https://jira.hyperledger.org/browse/FAB-359

oiakovlev
2016-10-07 14:29
yeah it is linked to FAB-361, so read all of them

oiakovlev
2016-10-07 14:30
if I got it correctly- there are still some ongoing discussions, so some detailed technical specification can't exist right now... am I correct?

mart0nix
2016-10-07 14:31
has joined #fabric-consensus-dev

jyellick
2016-10-07 14:36
Yes, things are still evolving, but reading JIRA and slack is probably the way to get the best picture, so looks like you're on your way

oiakovlev
2016-10-07 14:36
:slightly_smiling_face: thanks again!

jyellick
2016-10-07 14:40
You're welcome, will be here if you have any other questions

phyrex
2016-10-10 11:02
has joined #fabric-consensus-dev

zemtsov
2016-10-10 13:18
has joined #fabric-consensus-dev

crazybit
2016-10-10 13:37
has joined #fabric-consensus-dev

tom.appleyard
2016-10-10 14:43
has joined #fabric-consensus-dev

tom.appleyard
2016-10-10 14:44
Quick question about PBFT - would anyone be able to explain on what grounds the validating leader is changed? (i.e. is it every block, after a certain number of TXs etc.) Following from this how is the new leader chosen? I'm told this happens through an election of some kind - what kicks this off, what decides how votes are cast?

jyellick
2016-10-10 14:46
@tom.appleyard There are a number of conditions which can cause a change in leadership, but in general they can be boiled down to 'not allowing the network to make progress'

jyellick
2016-10-10 14:47
So, for instance if the primary is refusing to order a message from another replica, or the primary is skipping sequence numbers, etc., this would be grounds for the other replicas to issue view changes, to cause leadership to change

jyellick
2016-10-10 14:47
Leadership changes in a round-robin fashion, you can compute the leader as the `view % N` where `N` is the number of replicas in the network.

jyellick
2016-10-10 14:48
Essentially, if the new leader does its job correctly, then it will stay the leader, otherwise the rest of the network will send view-changes to move on to the next

tom.appleyard
2016-10-10 14:49
ah right cool - thanks! another quick question - when an node that is out of date wants to get up to date it as I understand it, it gets deltas/blocks/a snapshot - how does it decide which to get? Can it measure it's out-of-dateness?

jyellick
2016-10-10 14:53
Yes, it sounds like you are referring to the 0.5/0.6 release. In this case, there is a configuration variable you can see in the `peer/core.yaml` `statetranfser.maxdeltas`. If the peer's block height is within that variable number of blocks of the network block height, then it will attempt to transfer via state deltas. If it is further out of date than that, then it will attempt to recover via the state snapshot. Note that the number of state deltas retained by a peer (to send to others) is controlled in that same file by `ledger.state.deltaHistorySize`, and you should be sure to keep `maxdeltas < deltaHistorySize`

tom.appleyard
2016-10-10 14:55
brill - thanks! :slightly_smiling_face:

kostas
2016-10-10 15:40
@yacovm: I just watched the demo video for gossip, nice. Can you provide a few more details on the bootstrap peer? Who will maintain/own this one in a network?

kostas
2016-10-10 15:41
Is there any overlap with the bootstrap server presented in FAB-359?

kostas
2016-10-10 15:42
And should we maybe make it so that every peer that joins the network has to announce themselves via a transaction on the ledger (I believe @jeffgarratt goes one step ahead and says that this transactions needs to be endorsed), so that this bootstrap peer is no longer needed?

jyellick
2016-10-10 16:10
@kostas My impression from the video was that all peers are capable of being bootstrapping peers, but that one/some must be selected for initial config, @yacovm please correct me if I am wrong

yacovm
2016-10-10 16:11
^ exactly

yacovm
2016-10-10 16:11
you can have any number of them as your bootstrap peers

yacovm
2016-10-10 16:13
about 359- I dunno if there is overlap, generally- the gossip component doesn't have any "reference" of roles of nodes in fabric.

yacovm
2016-10-10 16:14
our discovery API simply lets you somehow "know" about all the nodes in the network and the metadata that these nodes publish, you can leverage the metadata ([]bytes) to determine the role of the specific node in the world

kostas
2016-10-10 16:30
Understood. I figured that was an option, just wasn't sure if there were other assumptions here I wasn't aware of. Thanks.

kostas
2016-10-10 16:31
And the address of this bootstrapping peer (or peers) that is to be put into the config file is acquired how?

garisingh
2016-10-10 16:32
@kostas - endorsement seems pretty meaningless (other than its the normal course of checking to make sure that a proposal meets seem type of criteria) when it comes to things like "membership" (for lack of a better term). Endorsement is not equivalent to "approval" - meaning if a peer endorses a proposal to add another peer, what does that actually mean? I would think that there still needs to be some type of out of band process which actually collected a bunch of signatures which would actually be part of the proposal. Endorsement would check the fact that the proposal actually had enough signatures (for example)

garisingh
2016-10-10 16:34
the same holds true for orderers as well. maybe this was obvious to people but it seems we sometimes forget that actual approval to add anything happens out of band unless we plan on introducing some type of interactive / workflow flow (meaning there can be some type of intervention before automatically endorsing) I do see this mentioned in some of @jyellick 's comments

garisingh
2016-10-10 16:35
(not picking on you BTW)

yacovm
2016-10-10 16:46
the address is to be acquired in any manner you want, i guess. It can be like fabric 0.5 in the core.yaml file, or any other way (dns? multicast? I don't know)

yacovm
2016-10-10 16:47
Gari- I think there is *some* overlapping between endorsements and membership though- which peers can be endorsers of a chaincode

garisingh
2016-10-10 16:50
agreed - I just wanted to say that endorsement is not the same as approval (well at least by default).

muralisr
2016-10-10 16:52
@garisingh `...actual approval to add anything happens out of band …` this is specifically for things like adding a peer correct ?

muralisr
2016-10-10 16:52
ie, not a general statement for typical proposal flows

garisingh
2016-10-10 16:56
correct. but I think people sometimes get confused with endorsement and signature-based workflow / approval. That's not to say some could not implement chaincode which actually did some type of real approval (e.g. check for some entry, etc), but they are not equivalent

kostas
2016-10-10 22:52
@garisingh: Yeah, I was fuzzy on this and didn't quite get it when I heard it - hence the parenthesis and the reference to Jeff. Thanks for clarifying.

kostas
2016-10-11 08:34
@yacovm My question is a bit more practical (I think): how do I know which IP I'm going to add to the file? How do we imagine this playing out in a real-world scenario?

yacovm
2016-10-11 08:34
Kostas, I'm really not sure why you're asking this now. This is essentially the way fabric works...

kostas
2016-10-11 08:36
Well, for one - just because 0.5 made some assumptions doesn't mean they have to necessarily carry over to v1.

yacovm
2016-10-11 08:37
okay- this is a fair point

yacovm
2016-10-11 08:37
but let me ask you something then- how does discovery works in the real world?

yacovm
2016-10-11 08:37
you *always* need some bootstrapping endpoint

yacovm
2016-10-11 08:38
either it's your DNS server, or your preconfigured files

yacovm
2016-10-11 08:38
or maybe sometimes your address is known to someone else and it contacts you (i.e - ip multicast)

kostas
2016-10-11 08:39
I am asking this however because it's a genuine question. For instance for the orderers, it's a given that these guys that run the ordering network will need to call/email/fax each other before they come online, so that's how they know who's who.

yacovm
2016-10-11 08:39
depends on what type of orderers they are

yacovm
2016-10-11 08:39
if its a SOLO then its alone

yacovm
2016-10-11 08:39
if it's KAFKA, then they need to know their brothers (Zookeeper is statically configured AFAIK)

yacovm
2016-10-11 08:39
but this has nothing to do with the gossip bootstrapping

kostas
2016-10-11 08:39
The solo work is a stopgap measure.

kostas
2016-10-11 08:39
Again, I understand this.

kostas
2016-10-11 08:41
What I am simply asking: you expect then, that a peer who joins the network knows the address of another peer already and will use that as the bootstrapping peer address?

yacovm
2016-10-11 08:41
yes

yacovm
2016-10-11 08:42
it can be btw not a peer but someone else like the membership service, or the consensus itself

yacovm
2016-10-11 08:44
essentially the only thing that entity needs is to be able to answer to a certain protobuf stream object: ``` message MembershipRequest { AliveMessage selfInformation = 1; repeated string known = 2; } ``` With: ``` message MembershipResponse { repeated AliveMessage alive = 1; repeated AliveMessage dead = 2; } ``` So if any entity embeds inside itself a gossip component it'll work.

yacovm
2016-10-11 08:45
currently from what I know- when a peer needs to know the certificate of another peer to verify messages signed by it, it contacts the membership service via a gRPC call.

yacovm
2016-10-11 08:46
so, if security is enabled - the membership service needs to be up all the time, or else messages won't be verified in new peers (that have joined lately)

yacovm
2016-10-11 08:48
and, of course- if you have further questions feel free to ask me here or in private or in #fabric-gossip-dev

yacovm
2016-10-11 08:49
and btw it obviously can be a list of bootstrapping peers, not only 1...

bala.vellanki
2016-10-11 15:51
has joined #fabric-consensus-dev

jyellick
2016-10-11 15:54
https://hangouts.google.com/hangouts/_/nkqa6vwc6jeo3j4wsrp2shgi3qe <- Writing a bootstrapping feature file for a single chain

yacovm
2016-10-11 17:48
hey, can anyone explain to me something regarding to what @kostas wrote? ``` The orderer configuration needs to be embedded within the raw ledger. This is for two reasons: ``` ``` 1. The orderer service needs to be able to convey initial configuration and configuration changes to the peer service so that the peer knows how to properly validate the raw ledger being returned (this may be different per orderer implementation, and may as simple as verifying a public key, or as complicated as a PBFT f+1 out of N signatures or connections) ``` Obviously I understand that the "block validating policy" differs between types of orderers, but isn't the type of orderer in the network a static thing? if it doesn't change, why can't we have several "strategies(policies)" (like the strategy design pattern) pre-implemented in the peer, and the confugration of the peer will select which policy to use? ``` 2. The orderer service must agree to an initial configuration as without a common initial configuration (and common points of time for changing the configuration) the correctness of the orderer service may not be guaranteed. ``` I don't understand how this backs up the claim. The orderer doesn't need to agree on its configuration with the peers, only with other orderers.

kostas
2016-10-11 18:09
I didn't write the above, but I can answer your questions.

kostas
2016-10-11 18:10
How would the configuration of the peer would select which policy to use?

yacovm
2016-10-11 18:10
I thought that email was from you, sorry

yacovm
2016-10-11 18:11
oh, oops

yacovm
2016-10-11 18:11
the email is but the jira issue isn't :slightly_smiling_face:

yacovm
2016-10-11 18:12
let's say we have in the yaml file a string that maps to a certain struct type

yacovm
2016-10-11 18:12
and that struct type has an implementation of "what to do with a block"

yacovm
2016-10-11 18:13
(how to validate, and when)

kostas
2016-10-11 18:15
No, as was the case with the bootstrapping node question earlier, I'm almost never referring to the low-level how-do-we-code-it issue. That's clear. What I'm asking is: how do you decide whether you should connect to say, at least, 3 orderers instead of 2?

yacovm
2016-10-11 18:15
isn't the number of orderers static?

kostas
2016-10-11 18:15
No.

kostas
2016-10-11 18:16
While the type of orderer (i.e. consensus) may be static, the orderer network's configuration is not. It may also be the case that you, as a connected peer, cannot deduce which policy applies just by counting the number of orderers in the network, which I guess is what you imply. Consider for example an orderer network where you added X more orderers, and a higher amount of faults can be tolerated, but the orderer network doesn't switch to these thresholds right away. The peers need to receive a signal from the ordering service that specifies exactly when the new ordering rule applies.

yacovm
2016-10-11 18:16
I see, like a consensus service re-configuration

yacovm
2016-10-11 18:17
get a consensus on the new view

yacovm
2016-10-11 18:17
and then move to it, or something like that

kostas
2016-10-11 18:18
The point is that the network needs to have a concrete reference in time on when to switch to a new policy. (Given that the orderer reconfig is valid and passes all the checks.)

kostas
2016-10-11 18:18
> The orderer doesn't need to agree on its configuration with the peers, only with other orderers.

kostas
2016-10-11 18:18
So this hopefully addresses this statement as well.

yacovm
2016-10-11 18:19
I still don't understand why this needs to be written in the ledger

yacovm
2016-10-11 18:19
isn't there another way?

yacovm
2016-10-11 18:20
how about something like- the configuration (orderers endpoints) is hashed and submitted in each block

yacovm
2016-10-11 18:20
when a peer receives a block with an odd hash, it contacts the orderers and asks them "what's up?"

kostas
2016-10-11 18:21
I guess you can come up with several variations that could work, but what does this proposal bring that the original one doesn't?

kostas
2016-10-11 18:22
Insert a block that says "config/policy is now `foo`", and assume this is the policy going forward until a new such block.

kostas
2016-10-11 18:22
How is appending a hash of `foo` in every block better? (Genuine question.)

yacovm
2016-10-11 18:23
isn't it a race condition?

yacovm
2016-10-11 18:24
lets say the ordering service now went up from 4 to 10 instances

kostas
2016-10-11 18:24
How is it a race condition? Everything is ordered.

yacovm
2016-10-11 18:24
i'm a peer and I get a signed block from 2 byzantine peers

yacovm
2016-10-11 18:25
that block says, the configuration now is from 4 to 5 instances

yacovm
2016-10-11 18:25
I believe that block is valid because 2>1=f (out of 4) signed it

yacovm
2016-10-11 18:25
and I reconfigure myself to the new (false) configuration

kostas
2016-10-11 18:27
You don't establish any connections to the new orderers (and don't receive anything from them) until you get the block that says "hey we're switching to 10 nodes and this is your new `f`" which is still sent by the network of 4 instances.

yacovm
2016-10-11 18:27
oh, I see

kostas
2016-10-11 18:27
In your scenario, you imply that the new orderers can jump in and start shooting blocks right away.

kostas
2016-10-11 18:27
But that is not the case.

yacovm
2016-10-11 18:28
hmm wait

yacovm
2016-10-11 18:29
you said that *any* peer doesn't establish connections to the new orderers before *all* peers successfully received the "checkpoint" block saying "from now on, we're the new view" - doesn't that imply that byzantine peers would be able to slow down that move? (like, read from the socket really really slow)

yacovm
2016-10-11 18:30
i'm just raising a concern of course, not saying this is inherently flawed. just making a point

kostas
2016-10-11 18:30
Where did I say "all peers"?

yacovm
2016-10-11 18:30
you didn't but isn't that what is derived? or is it a majority of peers then?

kostas
2016-10-11 18:32
It's whatever the policy dictates. And the policy would probably say something along the lines of "if X out of Y certs have signed this, I will accept it".

yacovm
2016-10-11 18:35
so, how did you (in plural) decide to bootstrap then? an "admin" entity is creating the genesis block and sending it to the orderers?

kostas
2016-10-11 18:41
Correct. A bootstrap admin entity collects the relevant info (among other stuff, what's relevant for this conversation: orderer certs, orderer addresses, and consensus config), creates it a genesis block that the orderers receive, inspect, and launch with (if they approve what they see in there).

yacovm
2016-10-11 18:42
got it, thanks. but what did you mean in "change sets are welcome"?

yacovm
2016-10-11 18:43
it's not coded yet?

kostas
2016-10-11 18:43
We welcome code contributions?

yacovm
2016-10-11 18:43
lol yeah I mean- I thought people are assigned to it, and all

kostas
2016-10-11 18:45
Folks are assigned to it, and in fact we just completed the writing of the feature file that (hopefully) has everyone on the same page (see the Hangout link above). Some pieces are coded already, but we're definitely not all done.

yacovm
2016-10-11 18:46
the link to the feature file is in the hangout?

kostas
2016-10-11 18:47
The Hangout is where we are (or rather: were) still chatting about the feature file. Jeff is posting the feature file now.

yacovm
2016-10-11 18:48
ok thanks

jeffgarratt
2016-10-11 18:48
@binhn @keithsmith @garisingh @kostas @tuand @jyellick @sanchezl Here is a shot at the feature file for bootstrap

jeffgarratt
2016-10-11 18:49

markparz
2016-10-11 18:57
has joined #fabric-consensus-dev

tuand
2016-10-11 19:13
just so we don't lose all comments when slack scrolls off in a few hours, can we put the feature file in a gerrit change set and comment there ? @jeffgarratt ?

jeffgarratt
2016-10-11 20:10
@tuand sure thing

jeffgarratt
2016-10-11 20:10
we can comment on Jira also

lhaskins
2016-10-11 22:28
OK, I have a question: how many invokes or how much time does it take for a peer to "catch" up with it's query values after it is out of sync and it is needed in order to reach consensus? In other words, is the following behave scenario valid? (I thought it was, but vp2 isn't "catching" back up when I expect it to)

lhaskins
2016-10-11 22:31

akihikot
2016-10-12 06:35
has joined #fabric-consensus-dev

vukolic
2016-10-12 07:55
@kostas as discussed a few times before this solution to reconfiguring consenters needs to clearly spell out that assumptions on the trust/availability of an old configuration need to be maintained until the last peer transitions to the next configuration

kostas
2016-10-12 12:15
@vukolic Yes. (And IIRC, last time we discussed this, we also talked about a possible transition period that would allow us to move to a new view with a clean slate - i.e. no transactions from the old regime and the new regime.) How do you deal with the case that Yacov points out when a peer is slow on purpose and takes forever to transition on purpose?

vukolic
2016-10-12 12:16
We may not have guarantees for very slow peers

kostas
2016-10-12 12:17
So it is not "all" peers then, right?

vukolic
2016-10-12 12:18
In that case my wording above needs to be modified to hold for any correct and "reasonably fast" peer for some conservative def of reasonably fast

vukolic
2016-10-12 12:18
It is never all peers

vukolic
2016-10-12 12:18
Just all correct peers :slightly_smiling_face:

kostas
2016-10-12 12:19
Wonderful, that's what I have in mind as well. As an ordering service you give a sufficient warning for all "reasonably fast" peers to catch up.

kostas
2016-10-12 12:19
@lhaskins What is your checkpoint period `K` set to in this scenario?

kostas
2016-10-12 12:22
Even without that knowledge, I'd note that the scenario seems asymmetric, i.e. for an earlier identical case you expect `vp1` @ transaction 40+10 to have the same value as the rest of the network @ transaction 40, but for `vp2` you expect it to have the same value @ transaction 60+10, as the rest of the network @ transaction 50 (not 60). So if it were symmetric, you should be looking for a response of `200` on value `a`.

kostas
2016-10-12 12:23
But let's start with the checkpoint period value. Then we can use the algorithm to figure out exactly what the expected value should be.

lhaskins
2016-10-12 14:30
K=2 in this scenario. I expected that vp1 would catch up to vp0 and vp3 when vp2 was stopped. Instead I receive `200` from vp0 and vp3 and `210` from vp1 on line 71 of executing this scenario.

kostas
2016-10-12 17:29
I'm working on this with @lhaskins. With these parameters, the expectations from the BDD test need to be adjusted, we'll post an update.

garisingh
2016-10-12 17:37
I must say, that BDD test is one convoluted way to show that someday / some way peers will catch up :wink:

kostas
2016-10-12 17:42
Correct, the test may well end at line 59 (with a different expectation) to show that. I trust that the team has its reasons for doing a second pass during lines 60-90.

jyellick
2016-10-12 18:16
I have been working with @hgabor on debugging some sbft problems. The root cause of one is an interesting one that perhaps we can get broader comments on. The bug is arising in a test where the primary crashes and restarts before it has received a checkpoint certificate for seqNo=3. When it restarts, the network connections establish, and the backup replicas report that their last checkpoint cert was seqNo=2 (which is correct at the time), however, in flight on the wire are checkpoint messages for seqNo=3, which the backups then receive, and complete their execution for seqNo=3. Now the network is in a state where the primary has only executed to 2, and the backups have all executed to 3, and despite the Hello, the primary does not know it is behind.

jzhang
2016-10-12 18:17
@jyellick getting the following error from the SOLO orderer after sending a broadcast: Error: {"created":"@1476296174.715561000","description":"EOF","file":"../src/core/lib/iomgr/tcp_posix.c","file_line":235,"grpc_status":14}

jzhang
2016-10-12 18:17
any idea what might be causing it?

jyellick
2016-10-12 18:18
Yes, I've spoken with @anya about this

jyellick
2016-10-12 18:18
This is happening when the sdk is hanging up on the gRPC connection, but the orderer is trying to read a new message from that client stream

jyellick
2016-10-12 18:20
To clarify, is this error making it back to the client, or is it simply being emitted in the orderer log?

jzhang
2016-10-12 18:20
the client is being notified via the “error” event

jyellick
2016-10-12 18:23
So an error of "EOF" just means there was a connection hangup, is this a problem?

jyellick
2016-10-12 18:23
(A hangup in response to the hangup sent to the orderer)

jzhang
2016-10-12 18:24
ok, i see. likely not a problem.

vukolic
2016-10-12 18:24
@jyellick what's a higher level problem here

vukolic
2016-10-12 18:24
primary should be changed in the worst case

jyellick
2016-10-12 18:24
[And, as a side note, if you are running solo, I'd recommend pulling from https://gerrit.hyperledger.org/r/#/c/1479/ as it has a bug fix in it around empty transactions (also thanks to @anya for pointing it out to me)]

jyellick
2016-10-12 18:26
@vukolic The thing that made us turn to look at this, was that a test was failing non-deterministically. This occurred because sometimes the in flight messages were delivered before the hello (passes) and sometimes after (fails). If we are designing a Hello mechanism in order to allow a node rejoining the network to properly catch up, if it fails whenever there is traffic on the network (ie, almost always) then this seems like a problem.

vukolic
2016-10-12 18:27
so can you shed some light on "hello" mechanism?

jyellick
2016-10-12 18:28
This was @simon's invention, but essentially, whenever a new connection is made between two replicas, they exchange the last (weak? strong? I need to check) checkpoint they have. In this way, the joining replica can immediately know if it needs to state transfer, or if it can execute from where it is currently at.

vukolic
2016-10-12 18:28
(BTW I consider what you outlined so far - as nothing to be fixed :slightly_smiling_face: )

jyellick
2016-10-12 18:29
(Yes, I realize, having f failed nodes is not against protocol)

vukolic
2016-10-12 18:29
I told simon this is only optimistic

vukolic
2016-10-12 18:29
and suggested a different way of catching up

vukolic
2016-10-12 18:29
which is, in fact, rather straightforward

vukolic
2016-10-12 18:29
a replica could do that hello on restart - but this is only optimistic

vukolic
2016-10-12 18:30
to try to catch up without incurring much traffic it should simply adopt the sequence number and the view number when it gets a weak checkpoint cert that is ahead of its own time

vukolic
2016-10-12 18:30
there is no state transfer here

vukolic
2016-10-12 18:31
as there is no state to transfer

vukolic
2016-10-12 18:31
and the log replication (to serve the clients) can be done lazily

vukolic
2016-10-12 18:31
now

vukolic
2016-10-12 18:31
this is when we do not have consenter reconfig - with consenter reconfig things do get more involved

vukolic
2016-10-12 18:32
so example

vukolic
2016-10-12 18:32
I am in view number 5, seqno 9

vukolic
2016-10-12 18:32
I hear about veak checkpoint cert for view number 7 seqno =11

vukolic
2016-10-12 18:32
I immediatelly go there

vukolic
2016-10-12 18:33
you may ask the question how bug us the buffer for future checkpoint msgs

jyellick
2016-10-12 18:34
FYI, I do think we're going to need 'consenter reconfig' to a limited extent out of the box. But it is the easy sort of reconfig, namely changing the membership of what certs are allowed to inject traffic (ie, peer CA membership) . I don't think this should happen often, and I think we can include a "the config changed last at seqNo=X" in the hello so that the replica knows whether it actually does need to state transfer before resuming, or if it can be lazy as you indicated.

vukolic
2016-10-12 18:34
and this is where we can have some sort of watermark but we can discuss this later on

vukolic
2016-10-12 18:35
so you see - this reconfig info is the only state of our service except view number seqno and prevhash

jyellick
2016-10-12 18:35
Right, in order to support unordered-ness across streams, we will still need some limited watermarking, but not as sophisticated as true pbft.

vukolic
2016-10-12 18:35
exactly

vukolic
2016-10-12 18:35
so what I strongly argue for

vukolic
2016-10-12 18:36
in order for simple bft to be really simple

vukolic
2016-10-12 18:36
it should leverage the fact that the state is lightweight

vukolic
2016-10-12 18:36
so once we hear (by eavesdropping) that a checkpoint weak cert says: this is the state (prevhash, seqno, viewno)

vukolic
2016-10-12 18:37
we just adopt it

vukolic
2016-10-12 18:37
modulo reconfig - you can start processing the very next preprepare

vukolic
2016-10-12 18:38
there will be of course some reordering fun here - but the point is it can never be worse then saying stop I am going to transfer the RL before I actually start ordering next requests

vukolic
2016-10-12 18:38
no need to do that

vukolic
2016-10-12 18:38
this should be done lazily

jyellick
2016-10-12 18:39
Right, exactly, so long as the state is sufficiently light (and changes sufficiently infrequently) it becomes a non-issue in almost all cases

vukolic
2016-10-12 18:39
this does not mean that you cannot pull here and there this info

vukolic
2016-10-12 18:39
but doing this too often would just drown the network and in the presence of traffic a replica will just never catch up

vukolic
2016-10-12 18:39
with a pull based method

vukolic
2016-10-12 18:43
@hgabor ^^^ gabor pls see if this makes sense to you?


hgabor
2016-10-12 19:55
@vukolic sorry I am from mobile and not such an expert of the protocol as you. :) so not sure I totally get it. Do you mean using a hello on restart and taking weak checkpoints?

vukolic
2016-10-12 19:56
ok this is a TL;DR

vukolic
2016-10-12 19:57
- the hello mechanism is a nice to have (maybe) - but it cannot solve the problem of catching up

vukolic
2016-10-12 19:57
- catching up by pull (hello-like) in cases with lot of load will incur traffic and cannot be guaranteed to make the replica catch up

vukolic
2016-10-12 19:58
- cacthing up should be done by eavesdropping a weak checkpoint cert and lazily replicating the raw ledger hole afterwards

vukolic
2016-10-12 19:58
------

hgabor
2016-10-12 19:59
Now, there is no functionality in sbft for such replication so I will have to implement one, I guess

vukolic
2016-10-12 20:00
this is another thing who implements it but we are just trying to get to the same page

vukolic
2016-10-12 20:01
on implementation pls sync with @jyellick but let me know what's your decision there as I'd like to be following that more closely

hgabor
2016-10-12 20:03
Will this eavesdropping thing always work? Can't there be such a situation where we see no weak certs? Just thinking

vukolic
2016-10-12 20:03
if you are cut of from the network (partitioned replica) it might temporarily not work

vukolic
2016-10-12 20:04
but then a replica is rightfully behind

vukolic
2016-10-12 20:05
I will try to write this eavesdropping more precisely

vukolic
2016-10-12 20:05
so we are on the same page

vukolic
2016-10-12 20:05
(pseudocode)

vukolic
2016-10-12 20:06
I need to look at the code to see how sbft currently increments seqnos

vukolic
2016-10-12 20:06
let me try to address that tmw

hgabor
2016-10-12 20:09
Yes we are :relaxed: btw thanks for the help

vukolic
2016-10-13 08:14
@vita @yacovm @hgabor @jyellick I was discussing with @hgabor how do we get to the following feature

vukolic
2016-10-13 08:15
if the network is synchronous and there are no further updates - then the state is eventually the same at all correct consenters

vukolic
2016-10-13 08:15
the question is how do we get there in a simple and efficient way

yacovm
2016-10-13 08:15
consenters?

yacovm
2016-10-13 08:15
or peers?

vukolic
2016-10-13 08:16
consenters

vukolic
2016-10-13 08:16
state = raw ledger height

yacovm
2016-10-13 08:16
but, isn't that dependant on the type of consenter?

vukolic
2016-10-13 08:16
what's a type of the consenter?

yacovm
2016-10-13 08:17
for example- is it kafka, or is it SOLO, or pbft or whatever?

vukolic
2016-10-13 08:17
ah

vukolic
2016-10-13 08:17
ok

vukolic
2016-10-13 08:17
we are discussing a variant of pbft

yacovm
2016-10-13 08:17
sbft?

vukolic
2016-10-13 08:17
so we are in the bft world (in this simpleBFT which tends to be simplified PBFT that we are developing)

vukolic
2016-10-13 08:17
yes sBFT

yacovm
2016-10-13 08:18
you're discussing a scenario in which p<f nodes are falling behind?

vukolic
2016-10-13 08:18
p \le f - yes

yacovm
2016-10-13 08:18
\leq, yeah

vukolic
2016-10-13 08:18
ok, so

vukolic
2016-10-13 08:19
the idea is we do this in a simplest way possible but not simpler

vukolic
2016-10-13 08:20
so one idea is to have consenters periodically say hello to other consenters and ask them about their latest provable raw ledger batch height

vukolic
2016-10-13 08:20
proof comes from the signatures that we now have in sBFT so just assume we have it

vukolic
2016-10-13 08:21
the question is - and this goes for you gossip folks out there @vita @yacovm @mandler and others

vukolic
2016-10-13 08:21
1) shall this be pull based or push based and why?

vukolic
2016-10-13 08:21
2) to how many other consenters should a consenter do such pull/push

yacovm
2016-10-13 08:22
I don't understand why this is needed- let's assume a consenter peer has failed and came up again and is D blocks behind and it re-connects to all the rest of the nodes. It'll get a transaction/block/whatever, right? Won't it see that the sequence number on that block is much higher than its own and will figure it out by itself?

vukolic
2016-10-13 08:23
so if you get partitioned (not crashed) and then get reconnected again - your last seqno was 5 but others are at seqno 100

vukolic
2016-10-13 08:23
there are no further requests

vukolic
2016-10-13 08:23
how do you get to seqno=100

vukolic
2016-10-13 08:23
is the problem

yacovm
2016-10-13 08:24
i see

vukolic
2016-10-13 08:24
that needs to be solved in a very lightweight manner :slightly_smiling_face:

vukolic
2016-10-13 08:24
so - my take is - do a periodic pull-based hello to anywhere between 1 to log(N) consenters

yacovm
2016-10-13 08:24
well I personally think that the best thing to employ here is a combination of what you suggested and what I talked about- if X time has passed and no new requests, ask all peers for their height

vukolic
2016-10-13 08:24
where consenters are chosen randomly

yacovm
2016-10-13 08:24
Why log(N)? we're talking about consenters

yacovm
2016-10-13 08:25
they're not many

vukolic
2016-10-13 08:25
probably 1 is just goof enough

vukolic
2016-10-13 08:25
(well we will get there :)

vukolic
2016-10-13 08:25
but 1 is ok to start with

vukolic
2016-10-13 08:25
but it should be random

vukolic
2016-10-13 08:25
so periodic pull from one random consenter

vukolic
2016-10-13 08:25
or periodic push to - how many consenters?

yacovm
2016-10-13 08:26
I think it *should* be pull from a consenter that detected a period of inactivity

yacovm
2016-10-13 08:26
But I don't understand why you need to ask log(N). What scale are you talking about here?

yacovm
2016-10-13 08:27
I thought RSM doesn't scale well to lots of nodes

vukolic
2016-10-13 08:27
eventually, say later on in 2017, I want we run this sBFT with 100 consenters

vukolic
2016-10-13 08:27
well - define "doesn't scale" :slightly_smiling_face:

mandler
2016-10-13 08:27
In such a case I'd go for periodic (or even based) pull from a random set of neighbors (log(N) seems to be a reasonable choice

vukolic
2016-10-13 08:28
ok, so we have consensus on pull

vukolic
2016-10-13 08:28
for smallnetworks pull from random single consenter is ok

vukolic
2016-10-13 08:28
for larger we may want to pull from more, eventually?

yacovm
2016-10-13 08:28
yeah of course it should be pull, think of it- it's much more efficient because you know you are inactive, VS "all other nodes try and see are *everyone* ok"

vukolic
2016-10-13 08:29
yes I thought so - just a) sanity checking, b) informing you guys of our discussions

yacovm
2016-10-13 08:29
does sBFT work in decent performance with 100 nodes?

vukolic
2016-10-13 08:29
it does not work yet period

yacovm
2016-10-13 08:30
Marko, I think that when we get to 100 nodes of *consenters* we'll have a much bigger problem with the scale of the peers :wink:

mandler
2016-10-13 08:30
I'd go for the general case of log(N) from the beginning, to make it ready for future scalability.

yacovm
2016-10-13 08:30
I think it's better to have something custom like Max(Log(N) , 10)

vukolic
2016-10-13 08:30
but some other protocols I know of work with trhoughputs of say 500 bitcoin like tx per second with 100 nodes over geographically distant location with 100Mbps netwokring, where one could through more bandwidth and have linear scalability

vukolic
2016-10-13 08:31
but stability will be the challenge

vukolic
2016-10-13 08:31
anyway

vukolic
2016-10-13 08:31
there are strong reasons to suspect that decent performance is achievable with 100 nodes

vukolic
2016-10-13 08:32
more to come on that soon

yacovm
2016-10-13 08:32
I'm not much of an expert but I've never heard of like- ZooKeeper clusters with 100 nodes

yacovm
2016-10-13 08:32
So what's the trick with sBFT?

yacovm
2016-10-13 08:33
I have deja-voo now, I think I asked you this when you were visiting Haifa

vukolic
2016-10-13 08:34
we are conducting some experiments incl Zoookeeper

vukolic
2016-10-13 08:35
we should have shareable results soon

yacovm
2016-10-13 08:35
cool that'd be interesting to see

vukolic
2016-10-13 08:35
we can go even much better than that in common case without issues we discuss here - depending on how much latency we want to trade in

vukolic
2016-10-13 08:36
as Chain and Ring patterns have nice features such as

vukolic
2016-10-13 08:36
- throughput is the best it can be (in a network with homogenous bandwidth)

vukolic
2016-10-13 08:36
- nobody falls back behind by definition

vukolic
2016-10-13 08:36
the only issue is latency

vukolic
2016-10-13 08:37
but that may be not a huge issue - depending onthe use case

vukolic
2016-10-13 08:37
but this is now for #performance-benchmark

yacovm
2016-10-13 08:39
what is the benefit of having 100 consensus nodes, besides high availability and increasing `f`?

vukolic
2016-10-13 08:40
I'd say psychology and "fairness"

vukolic
2016-10-13 08:41
"every" participant in the blockchain network gets to have a piece of control at the heart of the system

yacovm
2016-10-13 08:43
hmm ok. Although isn't that a bit dangerous? as you introduce new participants, you also increase the chance of the system being stuck/compromised due to participants being offline/byzantine

c0rwin
2016-10-13 08:48
I’m just wondering what should be a write throughput of ZK cluster w/ 100 nodes? /cc @vukolic @yacovm

vukolic
2016-10-13 08:50
@c0rwin I expect we will have a paper in 10 days on this. Just a bit of patience pls - I will be posting the pointers here and #performance-benchmark once this is ready

c0rwin
2016-10-13 08:53
@vukolic looking forward for it :slightly_smiling_face:

zanejia
2016-10-13 13:06
has joined #fabric-consensus-dev

vita
2016-10-13 15:04
@vukolic What state are currently kept in the consenters and when do we prune the transcations. Is this something that was agreed on?

hgabor
2016-10-13 15:08
guys, with @vukolic and @jyellick (thanks for their great help) I am still looking into an unidentified SBFT bug. it started with https://jira.hyperledger.org/browse/FAB-624 but we will also have another one for it as there seem to be multiple missing parts and bugs.

hgabor
2016-10-13 15:09
In a specific test case, the primary node is restarted and it does not catch up to the others. We implemented a simple pull based update protocol for this, but using that it receives one additional batch that is not needed and the others don't have (duplicate).

sanchezl
2016-10-13 17:53
I will be taking a look at TLS w/ pinned self signed certificates over the next few days (see https://jira.hyperledger.org/browse/FAB-708)

jyellick
2016-10-13 17:57
For those who are interested, I've pushed out a first pass at a policy manager (which leverages the signature validation dsl) https://gerrit.hyperledger.org/r/#/c/1721/

chanderg
2016-10-14 06:02
has joined #fabric-consensus-dev

matanyahu
2016-10-14 07:21
Can someone confirm if SIEVE consensus is still on the agenda? Many presentations and documents quote it as being one of 3 pluggable consensus protocols available in Fabric but I cannot find any further details on it.

cca
2016-10-14 07:46
SIEVE is not a proper consensus protocol but a filter for preventing non-deterministic chaincode to disrupt the operation of the system. Sieve exists in prototype stage in V0.5/V0.6. The V1.0 architecture does not rely on it, as it provides the same filtering function in a different way.

benjamin
2016-10-14 07:52
has joined #fabric-consensus-dev

hgabor
2016-10-14 10:36
one big step forward, it seems that the bug is solved: 1) some small modifications in the bootup process of the starting (restarting) node (1-2 missing lines setting the state properly) 2) new view timer instead of a timer producing view changes (if anyone interested: see the last comment from https://jira.hyperledger.org/browse/FAB-624) 3) added a pull based synch protocol (parts of it). a pull is sent to a random node, which sends back its last batch

hgabor
2016-10-14 10:38
problem: the pull protocol was meant to be periodical, node has to periodically send a pull to one of its random neighbours. now only a 'one shot' version is implemented (after node start). this solves the problem (not just the testcase but the root cause, a specific situation) I had and does no harm to other test cases. BUT if the protocol would be periodical, the test would never terminate.

hgabor
2016-10-14 10:40
reason: tests use (sbft tests) a queue for message passing and if it is empty then the system terminates. but in case of a periodical pull it will be never empty as a new pull will always be queued.

hgabor
2016-10-14 11:43

vukolic
2016-10-14 13:23
@vita consenter state is basically pbft protocol state - in that sense it is well understood.

vukolic
2016-10-14 13:23
this includes sequence number, view number, p set q set (optionally)

vukolic
2016-10-14 13:23
what we add, specific to fabric, is previous batch hash

vukolic
2016-10-14 13:23
that's about it

vukolic
2016-10-14 13:25
there is other auxiliary state (like list of pending requests)

n.ohagan
2016-10-14 13:55
has joined #fabric-consensus-dev

tuand
2016-10-14 14:46
@tuand uploaded a file: https://hyperledgerproject.slack.com/files/tuand/F2PFCBHMM/draft_genesis_block.txt and commented: @jyellick @jeffgarratt ... working on the tool that's creating the genesis block. thinking of using JSON file for input to tool ... what fields do we need for the modification policies ?

jeffgarratt
2016-10-14 14:47
I think that is a form of the Policy struct defined in ab.proto

tuand
2016-10-14 14:47
... and any other fields that I'm missing ? I'm following what we have in FAB-359 and FAB-665

jeffgarratt
2016-10-14 14:47
@jyellick may be able to provide a more specific answer

jyellick
2016-10-14 14:51
@tuand Maybe simpler is better for the tool, but think about whether we want to allow different orderer or peer entity policies

jyellick
2016-10-14 14:51
For instance, what if the orderer service is not BFT and we want to support an orderer CA

jyellick
2016-10-14 14:51
Or, what if we want 2 signatures from peer entities for transaction ingress

jyellick
2016-10-14 14:52
Right now I'm working the other end of this problem, creating the genesis block statically with the necessary embedded policy. There is a changeset out there, but shortly after pushing it, I decided I hated it, so am really reworking both the messages and the implementation, so it may be of limited value to reference now

tuand
2016-10-14 14:58
so i'm working my way through the flow like this 1.admin gathers certs/policies/addresses:ports

tuand
2016-10-14 14:58
2. admin creates this json file and inputs to tool

tuand
2016-10-14 14:59
3.tool munges on input json file and creates a serialized genesis block to disk

tuand
2016-10-14 15:00
4.orderer starts, reads, unmarshalls and writes to system ledger

tuand
2016-10-14 15:01
i think you're working on 4. ? or parts of 3. as well ?

tuand
2016-10-14 15:02
at this point, i'm trying to see if i can describe the policy in json so that i can map to protobuf ... i'll go look at ab again

jeffgarratt
2016-10-14 15:08
@jyellick @tuand fyi I am working on the bootstrap BDD

jeffgarratt
2016-10-14 15:08
I will ping as soon as I get to that point

jeffgarratt
2016-10-14 15:08
ie the gen block

jyellick
2016-10-14 15:09
@tuand You can generate JSON from protobuf, FYI, and unmarshal JSON into protobuf objects

jyellick
2016-10-14 15:10
With respect to 4, I am handling the bit where the startup consumes a genesis block, and creating a genesis block statically in code

jyellick
2016-10-14 15:10
The piece I am working on at this instant, is taking a config transaction, validating that it matches all of the existing policies, then generating a new configuration with new policies

jyellick
2016-10-14 15:11
In order to generate the genesis block, you will ultimately need the proto I am working on for the configuration, but it is not ready yet

tuand
2016-10-14 15:12
jason, what's keeping you :smile: good deal. I'm going to play around a bit more with json <-> protobuf

n.ohagan
2016-10-14 15:25
@n.ohagan has left the channel

jyellick
2016-10-15 03:39
@hgabor From earlier you said: > problem: the pull protocol was meant to be periodical, node has to periodically send a pull to one of its random neighbours. now only a 'one shot' version is implemented (after node start). this solves the problem (not just the testcase but the root cause, a specific situation) I had and does no harm to other test cases. BUT if the protocol would be periodical, the test would never terminate. I have been thinking about this. Because we have a stateful network protocol, we can tell when a connection is established/ended. All online nodes have come to an agreement on the world state, would it be safe to stop this timer? It could always be restarted once someone joins. I dislike the idea of the network operating in a 'special' mode for tests (like having a limited number of times this timer can pop, or disabling it for some tests) so a solution like that would preferable to me.

mcampora
2016-10-16 06:36
has joined #fabric-consensus-dev

hgabor
2016-10-16 16:05
> All online nodes have come to an agreement on the world state, would it be safe to stop this timer?

hgabor
2016-10-16 16:06
sounds good but from implementation point of view I am not sure (YET) how would he do it

matanyahu
2016-10-16 20:15
@cca : thanks for your answer on SIEVE

conghonglei
2016-10-17 01:29
has joined #fabric-consensus-dev

hgabor
2016-10-17 08:38
guys, today we are having some long internal meetings but I will try to join the scrum if I can. this week, I will have to work on some internal tasks but will try to have a look at SBFT bug fix and take care of the comments there (https://gerrit.hyperledger.org/r/#/c/1737/). @jyellick please see my answer for you from Oct 15th :slightly_smiling_face:

stchrysa
2016-10-17 11:13
has joined #fabric-consensus-dev

jamie.steiner
2016-10-17 12:37
Hi I have a question about the interaction between chaincode and transactions that are processed when interacting with chaincode: A) I understand that consensus is applied to new chaincode that is being added - if i add new chaincode, and then interact with it, say to add a new transaction type that is handled by that new chaincode, does the transaction itself get validated according to PBFT? B) on what basis do validating nodes decide if new chaincode is valid/safe?

tuand
2016-10-17 14:03
anyone able to join the hangout ?

jyellick
2016-10-17 14:03
@jamie.steiner I think that @muralisr is a better person to answer this, but I will do my best A) In v0.5/v0.6 a chaincode deployment goes through PBFT consensus just like an invocation. In the v1 architecture, chaincode deployment is managed through the lifecycle chain code, someone who wishes to deploy a chaincode follows the normal invocation path of sending it to the desired endorsers for endorsement, then sends it through ordering and once the transaction makes it onto the chain, the chaincode is finally deployed. B) In general, validity/safety is assured by virtue of the deployer being trusted with deployment privileges.

jyellick
2016-10-17 14:03
@tuand post it here?


tuand
2016-10-17 14:04

tuand
2016-10-17 14:11
@sanchezl @kostas add your 1-liner scrum summary here

sanchezl
2016-10-17 14:26
I will continue looking into FAB-708 today.

hgabor
2016-10-17 15:06
https://hyperledgerproject.slack.com/archives/fabric-consensus-dev/p1476502749002154 @jyellick I think I misunderstood your comment. how does a node know if "All online nodes have come to an agreement on the world state" to stop its 'periodical pull synch protocol timer'?

jyellick
2016-10-17 15:06
Well, presumably, all nodes report the same sequence number and associated hashes?

jyellick
2016-10-17 15:07
(Note, this may never be the case under byzantine conditions, but usually would be the end state of our tests)

hgabor
2016-10-17 15:23
So e.g. I am node i and all the other nodes seem to have the same last batch when I pull. That's why I decide to stop my timer.. Do you think of this?

jyellick
2016-10-17 15:29
Yes, I think that sounds right, that it should stop when you have no outstanding requests, and everyone is reporting the same last batch on the pull

hgabor
2016-10-17 15:39
Btw pull synchronization uses random nodes. I am not sure we can suppose that everyone has the same batches

weeds
2016-10-17 15:52
has joined #fabric-consensus-dev

jyellick
2016-10-17 15:55
@tuand Please see https://gerrit.hyperledger.org/r/#/c/1817/ for static genesis block generation

jyellick
2016-10-17 15:55
@hgabor Could you not track the last pulled state from each node in the network?

hgabor
2016-10-17 15:56
I could but that would but a lot of messaging

jyellick
2016-10-17 15:56
It's not that there is an urgency for this timer to stop though? Just that we would like for it to eventually terminate in our tests?

jyellick
2016-10-17 15:57
Maybe this is the wrong approach, maybe we should do a special deviation for the tests

jyellick
2016-10-17 15:57
We obviously do not want to complicate the real code path unnecessarily just to simplify our tests

hgabor
2016-10-17 17:03
Yes, but then what to do in the tests? That is what I said, that the test's System implementation could terminate after it thinks it is only receiving pull and hello messages

ruslan.ardashev
2016-10-17 18:18
has joined #fabric-consensus-dev

hgabor
2016-10-18 14:21
@jyellick ?

jyellick
2016-10-18 14:21
@hgabor Am here.


jyellick
2016-10-18 14:25
Yes, I've been thinking on that, trying to come up with a good solution. Especially how to differentiate the case where "Now there are only pulls scheduled, but one of them will cause more work to happen" from "Now there are only pulls scheduled and nothing will happen"

jyellick
2016-10-18 14:31
The most elegant way I can come up with is essentially to set a flag for testing, which tracks the last pull from every replica, and, if all the pulls match, and the pull timer fires again, to stop the pull timer. I don't like it, but struggling to come up with a better way to handle it

hgabor
2016-10-18 14:37
Yeah but that way it will be needed to change the sbft implementation, as I see

jyellick
2016-10-18 14:38
Yes, I'm certainly open to other ideas, not wild about that solution, just the cleanest that comes to mind

hgabor
2016-10-18 14:38
I meant changing it and hack some logic into it only used for testing

hgabor
2016-10-18 14:40
My idea was "checking if there are only pulls" but as you said the case "Now there are only pulls scheduled, but one of them will cause more work to happen" will break it

jyellick
2016-10-18 14:52
Right

kostas
2016-10-18 15:59
So, I'm working on adding support for channels on the Kafka orderer. For reference, you have Alice, Bob interacting with two different Kafka orderers/shims, and the Kafka cluster / ZK ensemble standing behind that.

kostas
2016-10-18 15:59
(Client - > Shim -> Kafka Cluster | ZK Ensemble)

kostas
2016-10-18 16:03
What you want to do is have Alice send a transaction that says "I want a channel where only Bob and I can transact on", and in the end get a partition that only you two can read to/write from. (We of course assume that all shims play nicely, and conform to that partition's ACL. No byzantine faults here.)

kostas
2016-10-18 16:04
I looked at the relevant KIPs, etc. and I think my options come down to these.

kostas
2016-10-18 16:05
1. The ACL is maintained by the shim. You need some custom form of consensus among the shims to establish order. (Congratulations, you've added yet another headache.)

kostas
2016-10-18 16:09
2. The ACL is handled by Kafka, which in turns posts it to ZK. This is the native way that Kafka does ACLs but it presents a few issues. One, there's no API for this yet, as the underlying protocol is still at KIP/RFC level. We would get this by literally having the shim call the authorizer CLI. Two, even if that solves the ACL issue for channels (since Kafka can handle it natively), there will always be ordering-related metadata that we want to use at the shim level that is not already covered by Kafka, whether it's at KIP-level or not.

kostas
2016-10-18 16:11
3. Taking a cue from the above, have the shims interact _directly_ with the ZK ensemble. (This is essentially config info that you wish to persist in a distributed manner, think etcd.) This means we don't use Kafka's ACL feature, and instead enforce ACL on the shim level.

kostas
2016-10-18 16:13
4. Have the shims maintain the ACL (and all other related config metadata), but don't go with ZK, etcd, or any custom consensus mechanism between them. Instead, use a special Kafka topic for such kind of config metadata. The shims write to a special Kafka topic/partition, and apply the ACL once they've read it back from the partition.

kostas
2016-10-18 16:14
I like Option 4 more. (3 is nice as well, but probably adds more overhead.)

yacovm
2016-10-18 16:35
I don't know how Kafka works, but I have some experience with ZK and I think that option 3 is cool because it gives you: - a notification to Alice (Bob) when Bob (Alive) has selected a channel for them (in case they come up at the same time) you can register everyone under a shared path and set a watcher, or maybe decide some other method. But now when I think of it- why did you think only of alive and bob? is there something special for pairs or is this just an example for any K? because, if it's only for pairs- you can simply always decide that for each A,B the channel will be something like A < B? AB : BA or something like that

jyellick
2016-10-18 17:17
So, to offer dissenting opinion, I'll postulate that it is (1) that is the best answer. And, it can be accomplished with no _additional_ consensus.

kostas
2016-10-18 17:18
@yacovm Alice and Bob is just a simplification. It can and should work with multiple participants.

yacovm
2016-10-18 17:18
So if the set is A1,A2,A3... Ak why not arrange then lexically and define the shared channel to be the concatenation of the A_i's?

kostas
2016-10-18 17:20
The problem is not the naming of the channel. Imagine you wish to add or remove participants from it. These changes need to be ordered.

jyellick
2016-10-18 17:20
Each shim must be able to interpret the output from kafka for a partition as a blockchain, otherwise, it cannot fulfill the ab.proto definition. Additionally, there must be some consensus on the contents of a block, assuming the batch size is greater than one. So, there is a blockchain, which embeds the ACL, the contents of which are already consensed on, so each shim can evaluate requests against the ACL, I don't see the problem.

kostas
2016-10-18 17:22
This is essentially option 4.

jyellick
2016-10-18 17:22
How are blocks cut?

kostas
2016-10-18 17:23
By passing all the messages in a partition and cutting every, say, 100 messages you get back on that partition. Same underlying logic.

jyellick
2016-10-18 17:24
But what if I want to create a channel/partition and send 3 transactions on it?

yacovm
2016-10-18 17:24
kostas, regarding 4- what is the retention for the topic? it needs to be preserved for ever because if Alice joins next month she should be able to read that ACL, right?

kostas
2016-10-18 17:26
@yacovm: For now, yes.

kostas
2016-10-18 17:26
@jyellick: What about it?

jyellick
2016-10-18 17:28
Well, no block would ever be created? Because there were not 100 messages, only 3? What I am driving at, is that I think in order to cut blocks, that there must be some sort of leader election among the shims. Whether this is done through ZK, a special topic, or whatnot. And at this point, it seems like the ACL problem sort of 'goes away'

kostas
2016-10-18 17:29
Right, there is a timeout after which a "time-to-cut" message is sent, and when you read it back you know you can cut.

kostas
2016-10-18 17:29
It's the same logic as in Option 4. You are describing option 4.

kostas
2016-10-18 17:32
Option 4 is build on top of Kafka. Option 1 is have comms between the shims, leaving Kafka and ZK out of it.

jyellick
2016-10-18 17:33
Ah, okay, great, then yes, I don't really want (1). To me, (2),(3),(4) implied building some additional consensus for ACL support, and that struck me as unnecessary.

yacovm
2016-10-18 17:44
correct me if I'm wrong, I don't know much about kafka but if you have information messages `m1, m2, m3` sent on that option 4 kafka topic in that order and you decide to delete `m2` (from the disk), you can't, right? you need to "wait" until its "cleaned", but you can't have kafka clean only `m2`, so in the long term won't there be a space problem if you need `m1` but you want to delete `m2, ... m10000`? you'll need to "move" `m1` to the head of the topic or something? isn't it a management overhead? That's why I think (3) is better, because you can also "update" a node and not only "append".

kostas
2016-10-18 17:51
@yacovm: You send a message that says "members are now X, Y, Z", once the shims read this, they enforce this ACL. Even if the old message that said "members are now X, Y" is still available on the broker it doesn't matter because it has a smaller offset.

yacovm
2016-10-18 17:53
members of what are now X,Y,Z? I thought you have many channels, so you need for *each* channel to say: members of channel *C* are: X, Y, Z, and send that on that shared topic, or am I missing something?

yacovm
2016-10-18 17:56
so I'm asking- if I understand correctly and you have 1 kafka topic in which you send *everything*, then if there is a channel *C'* for which there is no update in its membership, it can't be deleted, otherwise it'll disappear. am I wrong?

kostas
2016-10-18 17:58
Members of a channel 'C' are now X, Y, Z.

kostas
2016-10-18 17:58
The first statement is correct.

kostas
2016-10-18 18:01
The second statement is also correct. We've discussed the option of closing a channel if there hasn't been any activity on it for a while, so I guess at this point you would also post on the same topic, but nothing concrete yet. At any rate, if the Kafka partition is to be pruned, you would have to re-persist the channel memberships periodically.

yacovm
2016-10-18 18:02
and who does the re-persistence? don't you need consensus on that?

yacovm
2016-10-18 18:02
how does a shim know he's the re-persister?

kostas
2016-10-18 18:03
The pruning of a Kafka topic is actually a config setting in Kafka, it happens w/ no interaction from the shims.

kostas
2016-10-18 18:04
Ah, you deleted the Q :simple_smile:

yacovm
2016-10-18 18:04
yeah I sometimes type something and think of another thing

kostas
2016-10-18 18:04
No problem, I do that a lot as well.

kostas
2016-10-18 18:11
As for the rest of the questions, this goes back to the "how do we cut a block" discussion. Introduce a random amount of delay to each shim, and have them all post to the topic. (So: if X, Y, see Z's message on the delivery stream first, and it matches what they were to send, they cancel their own transmission.) This not the cleanest of solutions, and a case where Option 3 would help (by setting a watch, etc.).

yacovm
2016-10-18 18:12
yeah, that's all I wanted to bring up (well, I *am* a fan of ZK so maybe its biased)

kostas
2016-10-18 18:14
Yes, I hear you. Thanks for the feedback. I haven't discounted Option 3 entirely yet, and depending on what my tests show I may bring it up again.

cca
2016-10-18 19:18
@kostas : good discussion! Your 3 would look silly given that there is already such a coordination tool, which is option 4. overall, i would not make it overly complex: start with a static configuration, topics remain there forever (cannot be closed), just like their ledgers would remain, and so on.

umasuthan
2016-10-19 09:18
has joined #fabric-consensus-dev

vukolic
2016-10-19 13:04
@kostas - I need to catch up on the above discussion on Kafka ACL - but at some point it would be great you sync with Elli/Angelo whom are working on ACL for channels - so this should not be independent :slightly_smiling_face: @elli @adc

adc
2016-10-19 13:06
Indeed, great :slightly_smiling_face:

mart0nix
2016-10-19 13:12
how is the merge coming along

mart0nix
2016-10-19 13:13
I'm eager to start using some new features in the convergence branch

tuand
2016-10-19 13:13
@mart0nix the maintainers are using #fabric-reconcile

mart0nix
2016-10-19 13:17
@tuand thanks : ))

mart0nix
2016-10-19 13:18
@tuand are you part of the IBM dev team ?

tuand
2016-10-19 13:19
a small cog of the team, yes :slightly_smiling_face:

elli
2016-10-19 16:03
@kostas, it would be great if we could talk indeed.

elli
2016-10-19 16:03
Thanks @vukolic!

jyellick
2016-10-19 17:08
@elli I would also like to be involved in the channel ACL discussion. In order to support configuration/reconfiguration I've already done some ACL work around this, and would love to generalize it to handle the channel needs

kostas
2016-10-19 18:22
@cca: Maybe it's not entirely silly, as it could give us a clear leader in operations such as cutting a block, or periodic persisting of channel config that we know address with some redundancy and extra noise, but yes, I get your point and I'll keep it simple.

kostas
2016-10-19 18:22
@vukolic @adc @elli Will do, working on a very simple prototype now to test out the Kafka APIs, and whatever overlaps with your domain will be rewritten. I'll touch base with you next week when it's ready.

kostas
2016-10-19 18:22
@jyellick I want to use the work you're doing on policy implementations, so we'll tie everything together.

cca
2016-10-19 19:24
well, i understand little of Kafka, but i do not see why the shims should interact directly with the ZK inside kafka, this seems non-modular.

cca
2016-10-19 19:25
instead, it would be nice to use the kafka API for this. something like this: before any message is sent on the channel, the creator sends the ACLs, the shims filter this out and enforce it. (wasnt this the genesys block?)

cca
2016-10-19 19:26
btw, how does a "channel" map to kafka? channel <=> kafka topic ?

kostas
2016-10-19 19:47
@cca: Point taken on overloading Kafka's ZK. Your second statement is more-or-less correct. As to your question, based on the APIs that I see available so far, we're looking at a new topic with a single partition. (A new partition in an existing topic could also work of course, but the API is not there.)

adc
2016-10-20 07:44
@kostas thanks :slightly_smiling_face:

adc
2016-10-20 08:53
There is an IBM page (https://developer.ibm.com/hadoop/2016/07/20/kafka-acls/) that describes ACLs in Kafka. Really nice. For topics, ACLs can configured on the following operations: CREATE/READ/WRITE/DESCRIBE.

adc
2016-10-20 08:53
So, the Kafka cluster administrator can decide who is allowed to create topics. Nice!

garisingh
2016-10-20 10:38
Just catching up on this conversation as I was trapped in meetings yesterday.

garisingh
2016-10-20 10:39
I assume that you do know that it is possible to right your own authorization plugin for Kafka (the default simple authorizer does store ACLs in ZK)?

garisingh
2016-10-20 10:41
and just to double-check, the plan is that each channel maps to a Kafka topic?

adc
2016-10-20 11:07
yes, I have the same understanding

hgabor
2016-10-20 12:32
guys, I am on sick leave today so I won't participate in the daily scrum. the SBFT thing was merged and my patch for it: https://gerrit.hyperledger.org/r/#/c/1737/ it only adds one time pull synchronization. @tuand @kostas @vukolic @jyellick

kostas
2016-10-20 12:37
@garisingh I am well aware of that, and had also posted the relevant links for this in the crypto channel.

kostas
2016-10-20 12:37
RE: channel mapping to a Kafka topic - yes.

kostas
2016-10-20 12:45
( @garisingh: This is Option 2 in my write-up above.)

sanchezl
2016-10-20 13:38
Is there an recommended convention for naming orderer nodes?

tuand
2016-10-20 13:44
for 0.6 yes , required naming vp0,vp1, etc ... for v1.0 no, we should be able to map from certs

jyellick
2016-10-20 14:00
@tuand post the hangout link here in case others want to join?

tuand
2016-10-20 14:01
worked earlier today but not now !

2016-10-20 14:01
@tuand has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/qdazmwzqg5apdlfx3suuutlutie.

adc
2016-10-20 14:02
@kostas, may you clarify me what is the shim in the context of Kafka? Sorry :disappointed:

kostas
2016-10-20 14:15
@adc: Sure. It's essentially the middleman between the peers and the Kafka cluster. The peers issue Broadcast and Deliver RPCs and the shim turns them into the proper API calls for the Kafka cluster. See: https://gerrit.hyperledger.org/r/#/c/1627/

adc
2016-10-20 14:18
ah, perfet. Got it. Thanks :slightly_smiling_face:

kostas
2016-10-20 14:20
Sure thing.

tuand
2016-10-20 14:21
working on FAB-665 (https://jira.hyperledger.org/browse/FAB-665) which is a tool for admins to create the genesis block as defined for bootstrapping in FAB-359. Also seeing how I can generate this block according to @jyellick 's proposed ab.proto changeset ( https://gerrit.hyperledger.org/r/#/c/1795/1 )

jyellick
2016-10-20 14:22
@tuand You are likely better off refering to https://gerrit.hyperledger.org/r/#/c/1817/ as this generates a simple genesis block using those protos (statically, in code)

tuand
2016-10-20 14:24
thanks jason ! 1817 is on a tab somewhere on my desktop :slightly_smiling_face:

elli
2016-10-20 14:30
Hi, so @kostas, does it mean that shim is able to update the ACLs? If so, would it be responsible for processing a specific type of transaction, e.g., configuration one?

elli
2016-10-20 14:30
For example if a channel is owned be two parties, it may be that both of them need to "agree" to have the ACL extended by one party or have the ACL reduced by one or more parties.

kostas
2016-10-20 15:57
@elli: The shim should be able to update the ACLs, and it would be responsible for processing the CONFIG tx, yes.

jyellick
2016-10-20 19:10
@kostas I've heard it proposed that the Kafka shim will need to maintain a copy of the rawledger (to support (re)configuration, and to allow the possibility of infinite block retention). If this is the case, then it seems like each channel would have one consumer, writing to the rawledger, and then clients would read from this rawledger (not spawning a consumer per client). I know you have (understandably) expressed an interest to try to keep the kafka mappings as 1-1 and not re-invent functionality, so I wanted to check to see if you agree this is the plan or if you propose some alternative?

kostas
2016-10-20 19:27
@jyellick I'm not sure I agree with this, but maybe I'm missing something.

kostas
2016-10-20 19:28
For infinite block retention, my intention was to simply set `log.retention.hours=2147483647` in Kafka itself. This is approx. 250,000 years.

kostas
2016-10-20 19:29
> If this is the case, then it seems like each channel would have one consumer, writing to the rawledger, and then clients would read from this rawledger (not spawning a consumer per client).

kostas
2016-10-20 19:31
Can you expand on this explanation? If a channel has one consumer, then this is one of the shim-carrying nodes. Which one? And doesn't this mean you have to worry about this node crashing, etc. (basically recreate the problem that the cluster was set to solve)?

jyellick
2016-10-20 19:33
Ah, yes, so, careless wording. I should have said "seems like each channel would have one consumer _per shim_, writing to the rawledger". I would correspondingly assume that there would be one producer per channel, per shim, rather than spawning a new one per client.

jyellick
2016-10-20 19:38
The piece which seems most problematic trying to keep the 1-1 mapping of broadcaster<->producer and deliverer<->consumer is that because the blockstream contains configuration, it must be parsed. The producers and consumers are not truly independent, the current block height may affect the behavior of them both. (The particular problem I am thinking of, is who is allowed to broadcast/deliver to a chain) @kostas

jyellick
2016-10-20 19:45
I suppose just as the plan is for block cutting, the shims could share a channel where they post chain/channel reconfigurations, this could be dodged.

jyellick
2016-10-20 19:46
I think I will backtrack and say the more I think about this, the more I am convinced that we should keep the 1-1 passthrough.

kostas
2016-10-20 19:59
> I suppose just as the plan is for block cutting, the shims could share a channel where they post chain/channel reconfigurations, this could be dodged.

kostas
2016-10-20 19:59

kostas
2016-10-20 20:03
I'm still not sure we've addressed the concern though?

jyellick
2016-10-20 20:04
Perhaps not entirely. I think maybe what I need to see is exactly how block cutting is going to work.

jyellick
2016-10-20 20:07
The reason why I started this conversation is I was thinking trying to unify the Kafka/Solo common components into a single codebase would be a good idea. Where the Kafka shim is simply populating a local rawledger, this would be a trivial exercise, however, after some thought, I'm inclined to agree with you, that that's the wrong way to do this.

kostas
2016-10-20 20:15
So, right, in Kafka mode, is there any reason not have the rawledger reside in a partition? Think of a topic with 2 partitions. In partition A you push all the transactions as you receive them by the clients, and then the timeout triggers (or the batch size threshold) and you send a message to cut for block X. When you read it back, you know which messages comprise block X and you can push block X to partition B, effectively making partition B your raw ledger. Of course the problem is the usual one: how do you prevent other shims from pushing the same block to partition B and thus ruining your chain. We can add some logic to the shins for this (and hey, option 3 would be nice) but we need to do it right.

kostas
2016-10-20 20:17
Maybe we should have the shim populate a rawledger after all? (Great, now you agree with my first thought, and now I agree with your first thought.)

kostas
2016-10-20 20:18
The problem is that whereas multiple "let's cut for block X" messages in partition A are harmless, that's not the case for partition B.

jyellick
2016-10-20 20:18
> and then the timeout triggers (or the batch size threshold) and you Who's timer? Who is you? (Assuming multiple shims) > and you send a message to cut for block X. Is this to a special topic? Or the same channel?

kostas
2016-10-20 20:19
You are a shim.

kostas
2016-10-20 20:19
Has to be the same partition where the messages flow.

jyellick
2016-10-20 20:21
So you have a partition for a channel, and shims broadcast messages to this partition, not blocks.

jyellick
2016-10-20 20:22
When a shim broadcasts a message to a partition, it starts a timer for when to cut the batch.

jyellick
2016-10-20 20:23
Now, does the shim track to see whether that batch is cut? Or does it send the cut message regardless?

jyellick
2016-10-20 20:24
Or does the timer start when a batch is cut?

kostas
2016-10-20 20:25
If it reads a cut message, before it sends its own I would expect it to stay silent.

jyellick
2016-10-20 20:25
So then, the shim must have a consumer for every topic it broadcasts to?

kostas
2016-10-20 20:25
Yes.

jyellick
2016-10-20 20:26
Okay, so there's a special dedicated per topic consumer for each shim, and then additional consumers are spawned per deliver call?

jyellick
2016-10-20 20:27
Ah, wait, but deliver does not consume from that topic, it must be a different one, with the blocks?

kostas
2016-10-20 20:28
The special consumers are for the shim to figure out when certain changes (ACL, blocks) take effect. The normal consumers are spawned to serve Deliver RPCs.

kostas
2016-10-20 20:28
It must be the one with the blocks, yes.

kostas
2016-10-20 20:28
This can be easily be a partition on the same topic by the way.

jyellick
2016-10-20 20:29
Okay, so, then does the shim need a consumer for _every_ partition, regardless of whether it is broadcasting to it at this time? Or is there a dedicated topic which maintains the config across channels?

jyellick
2016-10-20 20:30
And who's responsibility is it to put the block onto the block channel?

kostas
2016-10-20 20:30
So that last question brings me to this observation/problem:

yacovm
2016-10-20 20:30
Hey, sorry for barging in while you're discussing kafka stuff, but I have a question regarding multiChannels and I assume this is the best channel (pun intended) to ask - is there anyone here that knows more about the plan of how multi-channels will be implemented, besides what's written in Binh's google doc (which I've read)

yacovm
2016-10-20 20:31
something doesn't add up :neutral_face:

kostas
2016-10-20 20:31
What's the specific question?

kostas
2016-10-20 20:31
(Jason, will resume.)

yacovm
2016-10-20 20:31
well- where is the replication support, how can it be done?

yacovm
2016-10-20 20:32
we decided long ago that peers that join or were offline for a while, get data from fellow peers

jyellick
2016-10-20 20:34
Ah, in short, it was decided that tracking which peers were to subscribe to which channel is a function of the app. So the only piece of code which knows which peers are active on which channels would be the app

yacovm
2016-10-20 20:34
It is also said: ``` If the channel already exists, the list of Participants is the replacement of the existing list. The Orderers automatically replace the subscribers and eventually deliver the transaction together with other transactions on this channel. ``` This sounds like a problem. If I'm a peer and I didn't get a participant update about a *removal* of a peer from a channel I may replicate information to it

jyellick
2016-10-20 20:34
It could be that we need to add some other information when a peer is bootstrapped to a chain, indicating which other peers it is allowed to discuss this chain/channel with

yacovm
2016-10-20 20:35
what app? the chaincode?

jyellick
2016-10-20 20:35
No, the application, the thing using the chaincode

jyellick
2016-10-20 20:35
Channel membership is done at a participant/org level, not a peer level

yacovm
2016-10-20 20:35
the node SDK then

kostas
2016-10-20 20:35
The app using the node SDK.

yacovm
2016-10-20 20:35
yeah

jyellick
2016-10-20 20:36
Any peer with a chain of trust to a participant CA is allowed to broadcast/deliver on that channel

jyellick
2016-10-20 20:36
But not every peer will want to transact on every channel

jyellick
2016-10-20 20:36
So, it is the application (built on the SDK) which decides which peers will have a copy of the contents of which channels

yacovm
2016-10-20 20:36
I'm asking only regarding data replication between peers

jyellick
2016-10-20 20:36
Right

yacovm
2016-10-20 20:37
I'm a peer and a peer asks a block from me. how do I know i'm allowed to send it to him?

jyellick
2016-10-20 20:37
I think there may be a shortcoming in the design from this respect, that there should be an API for the app to inform the peer of possible sync sources and destinations

yacovm
2016-10-20 20:37
(a block from a chain of a channel)

yacovm
2016-10-20 20:37
but, the only way of distributing such information has to be via a transaction, right?

jyellick
2016-10-20 20:38
Why?

yacovm
2016-10-20 20:38
how else you suppose to do that then?

jyellick
2016-10-20 20:38
The app must inform a peer to subscribe to a channel

jyellick
2016-10-20 20:38
This expressly cannot be done via another chain, because it would leak that information via that chain

yacovm
2016-10-20 20:39
you can't have an app contact peers and tell them about membership, it won't work.

yacovm
2016-10-20 20:39
and not a good idea- what about peers that are unavailable at that moment?

yacovm
2016-10-20 20:39
but are part of the channel?

jyellick
2016-10-20 20:39
What is the problem with that?

yacovm
2016-10-20 20:39
maybe I should bring these question up in the google doc or something

yacovm
2016-10-20 20:39
I was just checking maybe I didn't understand something

kostas
2016-10-20 20:39
Wait, what's the counter-argument for posting the reconfig in the channel again?

yacovm
2016-10-20 20:40
the counter argument is that I claim that the membership set *cannot* reduce in size, only extend

yacovm
2016-10-20 20:41
wait, I mean- that's not a counter argument, but that's what I'm saying

jyellick
2016-10-20 20:41
There is nothing which prevents the peers from maintaining a list of peers on the chain who are transacting there.

jyellick
2016-10-20 20:41
I meant that you could not maintain the membership on a different chain

yacovm
2016-10-20 20:42
you're right but I'm saying the list cannot throw nodes out of it, only grow

kostas
2016-10-20 20:42
@yacovm: Because you may be lagging behind and when you reconnect you communicate with peers that may be removed from that channel?

yacovm
2016-10-20 20:43
yes

jyellick
2016-10-20 20:44
My concern is making the configuration too large. Especially if the list cannot shrink, then with tens of thousands of peers over time, this could make the configuration transaction quite bloated, possibly problematically so

yacovm
2016-10-20 20:45
lol tens of thousands of peers? we'll retire by that time

yacovm
2016-10-20 20:46
I don't think 10s of thousands is something anyone should worry about now

jyellick
2016-10-20 20:46
If we are not architecting for that number, why are we even concerned about fetching older blocks peer to peer?

yacovm
2016-10-20 20:46
because a peer can join the network of any size, even of size 10

yacovm
2016-10-20 20:47
and the consensus cannot bring that information to a new peer

yacovm
2016-10-20 20:47
only fellow peers

yacovm
2016-10-20 20:47
at least that's what I've been told will happen in v1.0, and since our team are taking care of the synchronization part and Binh told me we shouldn't send information to a peer that isn't authorized I'm concerned here

jyellick
2016-10-20 20:48
Here is a gut proposal which may have problems, but hear it out. What if a peer could post a transaction to the chain, with its address/identity, requesting that other peers contact it supply state transfer?

yacovm
2016-10-20 20:48
I don't understand, can you elaborate?

kostas
2016-10-20 20:48
Hmm, clever.

jyellick
2016-10-20 20:49
So, the fact that a peer is authorized to transact on a channel implies that it is authorized to receive the chain for that channel

yacovm
2016-10-20 20:49
wait, that's a problem

yacovm
2016-10-20 20:49
who decides which peer will fill the gaps to that peer that joined?

yacovm
2016-10-20 20:49
it's either everyone, or you have a bystander effect

yacovm
2016-10-20 20:50
(no one will)

kostas
2016-10-20 20:50
The peer gets offers and takes up one of the other peers on their offer?

jyellick
2016-10-20 20:50
They would certainly not all have to send the blocks. Just say "Hi, ask me for blocks if you need them"

kostas
2016-10-20 20:51
Exactly.

yacovm
2016-10-20 20:51
so all peers gang up on him when he joins? doesn't sound very scalable, but now when I think of it- maybe I have an idea

yacovm
2016-10-20 20:52
when you submit a block, that block is related to a channel right?

yacovm
2016-10-20 20:52
I mean, you don't mix transactions from different channels in the same block

jyellick
2016-10-20 20:52
Correct

yacovm
2016-10-20 20:52
"multi-ledger"

yacovm
2016-10-20 20:52
so... why not simply append the membership information with each block?

jyellick
2016-10-20 20:53
Every transaction on a chain must contain the same chain ID (and each channel has a unique chain ID)

yacovm
2016-10-20 20:53
for example, the hash of all participant's PKI-ids

yacovm
2016-10-20 20:53
by the consensus

yacovm
2016-10-20 20:53
this solves it

jyellick
2016-10-20 20:53
I'm not sure I follow

kostas
2016-10-20 20:53
Unless the reconfig message is posted to the chain, it doesn't.

yacovm
2016-10-20 20:54
it is.

yacovm
2016-10-20 20:54
I'll explain why

kostas
2016-10-20 20:54
And actually even then it doesn't I think.

kostas
2016-10-20 20:54
Go for it.

yacovm
2016-10-20 20:54
let's say you're peer p0 and you're allowed to be in the channel until time T, ok? Starting from time T+epsilon, you're not allowed.

yacovm
2016-10-20 20:55
each block sent after T, is sent without you in the membership of that block, and each block sent prior to time T, includes you in the block as an authorized peer

yacovm
2016-10-20 20:55
let's say p1 is always part of the channel and p0 contacts p1 after T

yacovm
2016-10-20 20:55
p0 should've received all block before T if he was alive at that time

yacovm
2016-10-20 20:55
so it's "safe" to send blocks that were created before T to p0

yacovm
2016-10-20 20:56
blocks that were created after T, won't be sent because the peer (p1) checks the list of the block he's about to send to p0 and sees p0 isn't found there

yacovm
2016-10-20 20:56
I think that it doesn't impact heavily the size of the block , as long as we have like, up to hundreds of peers

jyellick
2016-10-20 20:57
Is it the hash of the set of peers, or a set of hashes of the peers?

yacovm
2016-10-20 20:57
the set of hashes

yacovm
2016-10-20 20:57
else it doesn't give you any information

jyellick
2016-10-20 20:57
Why not simply embed it in the configuration and not in every block?

yacovm
2016-10-20 20:58
because you can't send the new configuration to p1 in time

kostas
2016-10-20 20:58
Right, I thought you were referring to a different problem.

yacovm
2016-10-20 20:59
p1 might *also* get the block that is created after T, from a fellow peer, who did get the configuration but sent it to p1 as it's allowed to

jyellick
2016-10-20 21:00
I'm not sure I follow: > because you can't send the new configuration to p1 in time The configuration is in a block? You must have it before you can send future blocks

kostas
2016-10-20 21:00
Exactly. In your original problem statement, didn't you refer to a node that comes back online and doesn't know whether it can reach out to another peer for a specific channel?

yacovm
2016-10-20 21:01
you're saying that if p1 got a block that is created after time T, is must have gotten that configuration block?

kostas
2016-10-20 21:01
Yes.

yacovm
2016-10-20 21:01
that's only if it gets it from the consensus, what if the state replication isn't in-order?

yacovm
2016-10-20 21:02
and p1 got it from a peer p2 ?

yacovm
2016-10-20 21:02
oh

yacovm
2016-10-20 21:02
hmmm... I see

yacovm
2016-10-20 21:02
all blocks are part of the same chain

yacovm
2016-10-20 21:02
so it can't commit the block until it "committed" the configuration block, right?

kostas
2016-10-20 21:02
Unless everything is ordered and verified in a chain you don't act.

jyellick
2016-10-20 21:03
Exactly. This vastly simplifies state transfer, as you play state forward, rather than backwards then forwards.

yacovm
2016-10-20 21:04
yep. Thanks for the clarifications!

yacovm
2016-10-20 21:04
carry on with kafka


kostas
2016-10-20 21:05
And my response to that is --


kostas
2016-10-20 21:06
And why I'm thinking that a raw ledger (as to your original suggestion) might be inevitable.

jyellick
2016-10-20 21:07
Okay, now I'm following you

jyellick
2016-10-20 21:09
So, the reason why I reversed position on the raw ledger, is that in order to maintain a raw ledger, every shim must maintain a copy of every blockchain. So whereas Kafka allows you to take 20 kafka nodes to spread out 10k partitions (at maybe 1000 partitions per broker), this would force the shims to maintain 10k chains, regardless.

kostas
2016-10-20 21:09
Good point.

kostas
2016-10-20 21:12
I think inevitably we'll get down to option 3 (maybe with a different ZK ensemble) as a clearly elected leader could save us from all of these problems. Let's try to approach it like option 4 for now, and see if some simple logic from the shim side ("the shim who sent the first 'cut block' is supposed to put the block?" that can also fail in a ton of ways, e.g. that shim can crash but that's the example of logic that I'm talking about) can take us all the way there.

jyellick
2016-10-20 21:16
I was typing something similar. Option (3) seems somewhat inevitable. Or rather, each channel needs a leader to do the actual block cutting, and leveraging ZK seems like a natural fit. We can definitely use 4 to mimic some leader election, and that may be the path of least resistance for now, but, I would be wary of sinking too much effort into hacking on a 'leader election over Kafka', when there are obviously purpose built tools (like ZK) out there for just such a thing.

donovanhide
2016-10-20 21:17
Why not just raft?

jyellick
2016-10-20 21:18
@donovanhide RAFT would certainly be a solution, or etcd, or any of these other canned consensus options. But, since we are working with Kafka and must have a ZK deployment already, it seems like a good option.

kostas
2016-10-20 21:19
@donovanhide Option 3 would entail etcd (so Raft), or a ZK ensemble.

donovanhide
2016-10-20 21:19
Thanks for the answer :slightly_smiling_face:

jyellick
2016-10-20 21:20
Sure thing

kostas
2016-10-20 21:21
@kostas pinned a message to this channel.

jyellick
2016-10-20 21:22
@kostas So I am thinking maybe the common components across Kafka/Solo (and ultimately SBFT), is going to be the incoming broadcast filtering, block cutting based on some stream of messages (not only the incoming broadcast messages), and then block stream consumption (for reconfiguration)

kostas
2016-10-20 21:25
This sounds right to me.

qq
2016-10-21 01:41
has joined #fabric-consensus-dev

cca
2016-10-21 07:32
regarding the above discussion about when to "cut" a block: if this is the only reason to introduce some non-modular addon to the kafka system, then we should not do blocks. after all, the block in blockchain is an artefact of bitcoin's protocol. kafka uses a stream or sequence of messages, and logically that is what we need. fundamentally there is no need for blocks other than to keep a superficial similarity with the PoW consensus.

kostas
2016-10-21 13:11
Well, and on top of that, Kafka thrives performance-wise when it's dealing with small messages (in the single kB range -- not that it's bad otherwise, it'll probably still outperform whatever custom mechanism out there).

kostas
2016-10-21 13:13
But then the model between the two cases (CFT and BFT) becomes considerably different. In the BFT case we deal with blocks/batches on purpose, for performance reasons.

sanchezl
2016-10-21 13:37
If I start a peer directly in the vagrant dev env vm, where is it’s state saved?

kostas
2016-10-21 13:39
@sanchezl `/var/hyperledger/production`. This is also shown in the `core.yaml` file.

sanchezl
2016-10-21 13:40
thanks

jzhang
2016-10-21 14:04
@jyellick @kostas stupid question: what’s the future work on pbft like? isn’t that orthogonal to v1.0 architecture, other than maybe refactoring it to comply to the new orderer interface, which is just two methods: broadcast() and deliver()

jzhang
2016-10-21 14:05
i know part of it is adopting gossip

jyellick
2016-10-21 14:05
@jzhang That's correct. @simon has actually produced a sort of next gen pbft called sbft (simple bft) which we intend to use for v1

jzhang
2016-10-21 14:05
right?

jzhang
2016-10-21 14:05
ah ok

jyellick
2016-10-21 14:06
Hardening it is a lower priority item than getting the end to end flow solid via Kafka, so although it may be an experimental option, I would not expect for sbft to be 'production ready' anytime soon

jyellick
2016-10-21 14:07
Gossip I think is orthogonal to pbft/sbft, those will still rely on point to point links. Gossip is intended to help the peer network scale

jzhang
2016-10-21 14:09
Gossip is intended to help the peer network scale, meaning “committers” right? the consenters won’t be using gossip?

jyellick
2016-10-21 14:11
Referring to committers, yes. In the terminology that seems to be used most frequently, 'peers' are anything which has a copy of the validated / evaluated ledger. 'orderers' are the nodes which provide the ordering service. The ordering network runs some form of consensus. The peers may also choose to run some consensus on the validated ledger via the gossip network. So, we try not to use the word 'consenter' as this is an ambiguous term.

jzhang
2016-10-21 14:14
ok, fair enough, so the consensus network/cloud won’t be using gossip?

kostas
2016-10-21 14:14
Correct.

jzhang
2016-10-21 14:14
and the reason is? (just curious)

kostas
2016-10-21 14:15
I think the question should be reversed: why would it have to use gossip? For one, the cardinality of the orderer set is much smaller than the peer set.

jzhang
2016-10-21 14:17
interesting, i thought the opposite was true: the size of the consensus network (in terms of # of nodes) should be much larger compared to # of peers

jzhang
2016-10-21 14:18
i understand that peer nodes number is open ended

jyellick
2016-10-21 14:18
There is nothing which prevents an ordering network from using gossip, but for pbft/sbft the integrity of the point to point link between nodes is critical to the design/security of the protocol. For other ordering networks like Kafka, they seem to have chosen point to point as well, which makes sense, because gossip is great for large numbers of nodes, but bad for latency.

jyellick
2016-10-21 14:19
I would expect the ordering network to have comparatively few nodes to the peer network

kostas
2016-10-21 14:19
I can see no reason why would you want the number of nodes participating in the orderer network being larger than the peer network.

jyellick
2016-10-21 14:20
The ordering network runs nearly zero business logic, it simply orders transactions. It must, to allow for ACL management and a few other things process a very minimal 'configuration transaction' periodically, but this should be low overhead and rare.

jzhang
2016-10-21 14:20
@jyellick thanks for that on latency, makes sense

jyellick
2016-10-21 14:21
It is the peer network which actually maintains the complicated world state, generates and evaluates proposals, etc. The peer network should need larger numbers in order to effectively scale.

kostas
2016-10-21 14:21
A better way to put it: there is zero relationship between the size of the orderer set and the peer set.

jzhang
2016-10-21 14:22
@kostas @jyellick if i were to set up a network, i want the consensus cloud to be as large as i can afford so as to maximize the cost of collusion

kostas
2016-10-21 14:22
No, you want the nodes of the consensus network to be picked in such a way that the possibilities for collusion are minimized.

jzhang
2016-10-21 14:23
ok

garisingh
2016-10-21 14:28
@jyellick @kostas - for the comms between the orderers and the peers, would it make sense to use the same gossip protocol that may be used between peers? Or do we still go with a point to point link for peers to orderers (and then in the future for peers which can't contact ordering nodes directly rely on the peer gossip protocol)?

kostas
2016-10-21 14:28
The latter is what I have in mind.

garisingh
2016-10-21 14:29
I assumed as much and I think everyone is on the same page there but just wanted to check

jyellick
2016-10-21 14:31
That was also my assumption. I would think for gossip to be effective, both sides need to be able initiate connections, but I would expect in most cases, a network link initiated from the ordering service would not be able to contact a peer directly

yacovm
2016-10-21 14:33
isn't it basically a question of scalability + network conditions? If you have a lot of peers (1000) and only a small amount of orderers (i.e BFT) and maybe lots of network hiccups (bad WAN lines), I think gossip can be used even for consenters <---> peers.

garisingh
2016-10-21 14:40
@yacovm - I would assume that you will ALWAYS need some number of peers to be directly connected to the ordering service. We could then use gossip for those peers to "re-broadcast" to nodes which cannot directly reach the ordering service. As I think I once stated, I thought that one of the assumptions for atomic broadcast was that you actually rebroadcast to other peers you know about the first time you see a message. I know we are not there yet, but that's how I would see gossip playing in terms of relaying

yacovm
2016-10-21 14:41
I assume you're saying this only because the consenters don't keep the state, and sending via gossip isn't "atomic broadcast"

ozzyatwork
2016-10-21 16:20
has joined #fabric-consensus-dev

jyellick
2016-10-21 17:02
@yacovm I always like to go back to 'ordering as a service'. In this case, I would assume that the peer admin has a peer network on some LAN/VPN behind firewalls and other access control to keep random people from being able to attack it. Then, the peer admin punches an outgoing hole in the peer network firewall to the ordering service, so that the peers can connect to it. It doesn't seem realistic to me to expect the ordering service to be able to connect to the peers in this scenario, in which case the point to point style connections to the orderer with gossip among the peers makes much more sense to me.

yacovm
2016-10-21 17:04
well, empirically speaking, if you embed our gossip component in an orderer, it's only a matter of time until the probability-based gossip phase will make the peer connect to the orderer, and then the orderer can gossip to it because the gRPC stream is bidi.

yacovm
2016-10-21 17:05
our communication layer considers a remote peer "responsive" if you are connected to it or it is connected to you

jyellick
2016-10-21 17:20
I guess I am just not sure what the advantage to gossip is in this scenario? How is this better than having the a peer probabilistically determine whether it connects point to point to the ordering service, then disseminates via gossip?

yacovm
2016-10-21 17:22
oh, hmmm... I actually really like this idea you just proposed

yacovm
2016-10-21 17:22
wait, isn't that dangerous?

yacovm
2016-10-21 17:23
I assume your "direct connection to the orderer" code is the one that ensures atomic broadcast, right?

jyellick
2016-10-21 17:24
I'm not following

yacovm
2016-10-21 17:24
what if we're really unlucky, and only 1 peer connected directly (bad coin tosses)? that means everything goes through him

yacovm
2016-10-21 17:24
lets say you have a PBFT orderer

jyellick
2016-10-21 17:24
Okay

yacovm
2016-10-21 17:24
and 10 peers and 9 of them said "I'll connect via gossip" and only 1 of them connected directly

yacovm
2016-10-21 17:25
when the PBFT sends a transaction, it ensures that 1 peer got it, right? atomic broadcast

jyellick
2016-10-21 17:26
No, I think you may be misunderstanding atomic broadcast

jyellick
2016-10-21 17:28
Ignore PBFT, because it doesn't really matter. For an atomic broadcast service, clients can connect in and call `Broadcast` to cause a message to enter the service for ordering. Other (or the same) clients may call `Deliver` with a specified offset, and receive a stream of ordered messages (in our case, in blocks) starting from that offset, and continuing as they are produced. The contract from the ordering service, is that everyone gets the same `Deliver` messages, in the same order, regardless of which ordering node the client connected to.

jyellick
2016-10-21 17:29
So, in this case, if you have 10 peers, an 9 of them decide to gossip, and only 1 of them decides to connect to the ordering service, if that one has a problem, or decides not to forward things, after some period of time, the 9 would recompute their 'do I connect' logic, and connect to the ordering service with the offset of the first block they don't have, and pull down the stream of blocks, and everything would be fine.

jyellick
2016-10-21 17:30
The only real 'danger' would be if the ordering service prunes after some period of time, but given a sufficiently long pruning window, and sufficiently frequent recomputation of 'do I connect' at the peer side, this should be fine.

yacovm
2016-10-21 17:31
"after some period of time" - I thought consenters don't keep the ledger inside, how can you be sure that the consensus service kept the parts that were not delivered?

yacovm
2016-10-21 17:31
and, how can the other peers even "know" they were supposed to get blocks?

jyellick
2016-10-21 17:31
Out of the gate, the ordering service will likely retain the ledger indefinitely.

yacovm
2016-10-21 17:31
oh...I didn't know that.

jyellick
2016-10-21 17:31
Eventually, pruning will need to happen, but I would expect for the pruning interval to be weeks or months

jyellick
2016-10-21 17:32
Sufficiently long, that the peer network should notice the problem and correct it.

yacovm
2016-10-21 17:32
ok, then all is good. I thought only Kafka has that capability (because it comes with it) and that PBFT won't have the ledger

jyellick
2016-10-21 17:32
No, PBFT will definitely have a ledger, and allow seeking

jyellick
2016-10-21 17:33
We considered having some rule for "once it's been delivered, it's okay to prune"

jyellick
2016-10-21 17:33
But that gets very tricky, because that depends on how many faults you want to tolerate at the peer side

jyellick
2016-10-21 17:33
And then the orderers must consense about who has delivered what

jyellick
2016-10-21 17:33
At the end of the day 'retain it for a long time' is much simpler, and I think much more practical

yacovm
2016-10-21 17:34
ok so what about my other question: ``` and, how can the other peers even "know" they were supposed to get blocks? ```

jyellick
2016-10-21 17:37
That's a valid question, and one that you could try to solve probabilistically, but I think the simple answer would be, if I were a peer admin, I would want to ensure that at least one of my peers is connected to the ordering network at all times.

jyellick
2016-10-21 17:38
In fact, as you design the gossip network, I wonder if specifying the ability to select 'mandatory peers' would be a good idea

jyellick
2016-10-21 17:39
If I administer 10 peers, on a network of 1000 peers. I want to make sure that each peer is connected to 10 other peers, including 2 of my 10. For instance.

yacovm
2016-10-21 17:39
I also think that "privileged peers" that are always connected to the ordering service and "peripheral peers" that are connected via gossip is the way to go

yacovm
2016-10-21 17:40
i.e it can be passed in the configuration

yacovm
2016-10-21 17:40
oh, you're saying that you're trying to minimize the hop count, right? I think what's needed is actually the opposite- that peers that are connected directly to the ordering service don't pick to disseminate to peers that announce that they are also connected to the ordering service, that way they only forward to peripheral peers. It's actually really easy to incorporate into the existing code :slightly_smiling_face:

jyellick
2016-10-21 17:43
Oh, more, that I want to make sure that my peers have up to date information. And the only way to be sure of this is to ask the orderer. So, I want to make sure I can ask someone I trust (ie one of my peers) if I missed anything.

yacovm
2016-10-21 17:44
Ok, I got it.

jyellick
2016-10-21 17:45
It makes less sense for a large network, but in the 10 node network you describe, I would want to make sure one of my peers is privileged

yacovm
2016-10-21 17:45
what makes less sense? I'd say in a 10 node network, make them all connect to the orderer...

yacovm
2016-10-21 17:46
by the way, I did some thinking about multi-channel support and wrote something in #fabric-gossip-dev , you're more than welcome to take a look and comment if you have the time.

jyellick
2016-10-21 17:46
The larger the network, the more I can 'trust it', because there are so many more interests at stake, and therefore it's much harder to censor information to my peers. In a 10 node network, as you point out, if only 1 peer ends up connecting to the ordering service, and decides to not tell, then we have a problem. [Edit: That is why requiring one of my peers talk to the orderer directly makes less sense in a large network]

jyellick
2016-10-21 17:47
Ah, thanks, so many channels, hard to keep up

jyellick
2016-10-21 17:48
And agreed, in a 10 peer network, everyone should simply connect to ordering

tuand
2016-10-21 17:49
and who asked yacov to create #fabric-gossip-dev ? hmmm ? :innocent:

jzhang
2016-10-21 18:56
@simon @garisingh @jyellick @yacovm this is somewhat related to discussions above but from slightly different angle. the application would need to make contact with the consensus cloud on the broadcast side (just like peers on the deliver side), for trustworthiness purposes it’d also want to connect to more than one consensus node, preferably f+1. is that accurate?

jzhang
2016-10-21 18:56
if so, does it make sense for each org/enterprise to set up a proxy node to the consensus cloud which will communicate to f+1 consensus nodes and leaves the applications shielded from these concerns?

jzhang
2016-10-21 18:56

garisingh
2016-10-21 19:01
I think you need to be aware of multiple ordering nodes from an availability perspective but don't necessarily need to be connected to a specific number of them. You of course maintain a list a round robin between them as well. I think you would want to keep track of whether or not your transactions made it through though - for example if you used one ordering node and detected that some number of transactions were never processed you would want to switch to another node

garisingh
2016-10-21 19:02
Well that's my opinion. I like the idea of round robin approach tho

jzhang
2016-10-21 19:02
yes, a lot of logic like this sounds like middleware instead of having each application do that

jzhang
2016-10-21 19:03
such logic (round robin, rotating list) can be built into a proxy node that serves all apps against a consensus cloud for that org/enterprise

jzhang
2016-10-21 19:03
(I had the same round-robin idea in my discussion in FAB-476)

garisingh
2016-10-21 19:05
I guess it would depend on how many apps an enterprise had.

jyellick
2016-10-21 19:10
With respect to the byzantine attacks, as you point it, our primary concern is censorship, either in not forwarding transactions for ordering, or in not delivering blocks as they are created. Both of these should be pretty detectable from the client side. If a peer has not received a block in some amount of time from the consensus service, or if it learns that it's behind relative to other peers, it should try another ordering node. If a client submits trans to be broadcast and after some period of time, some percentage of them were never processed, it should switch nodes as well.

troyronda
2016-10-23 00:31
@jyellick which ledger would be retained indefinitely? Context being where subledgers are being employed for confidentiality (and whose contents shouldn’t be retained by an orderer service).

jyellick
2016-10-23 01:05
@troyronda because the subledger goes through ordering, it is like any other chain, and the ordering service must retain it, at the very least until enough peers have received it. Ultimately, we will implement pruning at the orderer, but this is a pending item (and why I say out of the gate, above). Do keep in mind that the ordering service is only retaining the raw blocks, it does not interrogate the block contents to build any sort of state (beyond configuration transactions)

hgabor
2016-10-24 08:38
@jyellick @kostas @vukolic I just updated this: https://gerrit.hyperledger.org/r/#/c/1737/

tuand
2016-10-24 14:01
still having problems re-using a hangout link .... creating a new one here for today's scrum

jyellick
2016-10-24 14:01
@tuand I think we should be posting the link here regardless

yacovm
2016-10-24 14:01
lol every single time

2016-10-24 14:01
@kostas has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/2loizp2ln5h4tgrescmnf6r63me.

2016-10-24 14:01
@tuand has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/egafrpefmng35dfwqppay2epaqe.

jyellick
2016-10-24 14:01
No point in re-using the link, and we want to make sure anyone who wants to join us is free to

yacovm
2016-10-24 14:02
there are 2 different scrums now

yacovm
2016-10-24 14:02
which one do I join to

kostas
2016-10-24 14:02
Go with Tuan's.

yacovm
2016-10-24 14:02
do you guys see Tuan's hangout link 2nd and kostas's first?

yacovm
2016-10-24 14:03
maybe we can use slack as an ordering service?

tuand
2016-10-24 14:03
ha !

kostas
2016-10-24 14:03
@sanchezl can you join?

sanchezl
2016-10-24 14:05
BRT



sanchezl
2016-10-24 17:06
help is needed to verify this change set on a Windows developer workstation: https://gerrit.hyperledger.org/r/#/c/1935

tuand
2016-10-24 17:08
luis, post on #fabric or #fabric-dev ? I've seen a couple folks ask about couchDB on those channels

ghaskins
2016-10-24 20:21
@sanchezl: FYI, if you add a comment "reverify" to 1935, it will kick the CI system to run again.

ghaskins
2016-10-24 20:21
This particular patch is a candidate for "ci-skip" also

ghaskins
2016-10-24 20:22
That facility works via the commit message

donovanhide
2016-10-25 11:45
Hi, I’m trying to get a five peer PBFT test rig running. When I do a `peer chaincode deploy` command on `peer-1` I get this in my logs: ```11:39:19.874 [consensus/pbft] ProcessEvent -> INFO 02a Replica 0 batch timer expired 11:39:19.874 [consensus/pbft] sendBatch -> INFO 02b Creating batch with 1 requests 11:39:28.853 [consensus/pbft] ProcessEvent -> INFO 02c Replica 0 view change timer expired, sending view change: new request batch pt7Y3/r6NmEbFMcknejbmQjcMRbI+3XOPhTN0KqNEf4Xy+/+yMpKsGW7Xp7VEijXGZDj9gC/WGUlDwNmCERwSA== 11:39:28.854 [consensus/pbft] sendViewChange -> INFO 02d Replica 0 sending view-change, v:1, h:0, |C|:1, |P|:0, |Q|:1 11:39:28.860 [consensus/pbft] recvViewChange -> INFO 02e Replica 0 received view-change from replica 0, v:1, h:0, |C|:1, |P|:0, |Q|:1 11:39:30.860 [consensus/pbft] sendViewChange -> INFO 02f Replica 0 sending view-change, v:1, h:0, |C|:1, |P|:0, |Q|:1 11:39:30.861 [consensus/pbft] recvViewChange -> INFO 030 Replica 0 received view-change from replica 0, v:1, h:0, |C|:1, |P|:0, |Q|:1 11:39:30.861 [consensus/pbft] recvViewChange -> WARN 031 Replica 0 already has a view change message for view 1 from replica 0 11:39:32.861 [consensus/pbft] sendViewChange -> INFO 032 Replica 0 sending view-change, v:1, h:0, |C|:1, |P|:0, |Q|:1 11:39:32.862 [consensus/pbft] recvViewChange -> INFO 033 Replica 0 received view-change from replica 0, v:1, h:0, |C|:1, |P|:0, |Q|:1 11:39:32.862 [consensus/pbft] recvViewChange -> WARN 034 Replica 0 already has a view change message for view 1 from replica 0 11:39:34.863 [consensus/pbft] sendViewChange -> INFO 035 Replica 0 sending view-change, v:1, h:0, |C|:1, |P|:0, |Q|:1``` But no other peer seems to be receiving any view-changes and their logs are empty. Does it look like this peer is sending view-changes to itself?

muralisr
2016-10-25 13:17
@kostas https://jenkins.hyperledger.org/job/fabric-verify-x86_64/2014/console not sure if this a one time thing but thought you might want to know...

muralisr
2016-10-25 13:18

kostas
2016-10-25 13:21
@donovanhide: The output makes sense, i.e. when counting view-change messages we also consider the one we sent. I am curious though as to why the peer is sending a view-change to vote off itself. How are you naming your peers?

kostas
2016-10-25 13:21
@muralisr: Thanks for the heads up, will investigate right away. sarama's mock package is not the greatest.

jyellick
2016-10-25 13:22
@donovanhide This looks like https://jira.hyperledger.org/browse/FAB-707 which is odd looking behavior, but normal [edit: actually, I'm not so convinced, but it's probably still worth reading]

donovanhide
2016-10-25 13:22
``` kubectl get pods NAME READY STATUS RESTARTS AGE peer-0 2/2 Running 0 1h peer-1 2/2 Running 0 1h peer-2 2/2 Running 0 1h peer-3 2/2 Running 0 1h peer-4 2/2 Running 0 1h``` Peers have a hostname of `peer-0` but a DNS name of peer-0.peer.default.svc.cluster.local

garisingh
2016-10-25 13:23
what's the peer.id for each?

kostas
2016-10-25 13:23
@donovanhide: I figured. You need to do a vpX name for all.

donovanhide
2016-10-25 13:24
I’m doing: ```command: ["sh","-c","sleep 10;CORE_PEER_ID=$(hostname) CORE_PEER_ADDRESS=$(hostname).peer.default.svc.cluster.local:7051 peer node start”]```

yacovm
2016-10-25 13:24
kostas, you're serious? vp-i is enforced?

kostas
2016-10-25 13:24
i.e. vp0, vp1, etc.

garisingh
2016-10-25 13:24
this has been known for a while - this is for v0.6.x

donovanhide
2016-10-25 13:24
Kubernetes Petsets enforces the hyphen, unfortunately...

kostas
2016-10-25 13:24
@yacovm: Yes. A known weakness.

donovanhide
2016-10-25 13:24
Wish someone had told me this a week ago :slightly_smiling_face:

garisingh
2016-10-25 13:25
but its just the peer.id not necessarily is crypto id

garisingh
2016-10-25 13:25
@donovanhide - I think I might have a while back - sorry if it was not clear :wink:

kostas
2016-10-25 13:25
@donovanhide: I can point you to the point in the code where you can maybe play around with the naming scheme and get it to work.

donovanhide
2016-10-25 13:25
@kostas that would be great :slightly_smiling_face:

yacovm
2016-10-25 13:25
can't you just set CORE_PEER_ID to something else?

yacovm
2016-10-25 13:25
you can derive the name from the hostname

donovanhide
2016-10-25 13:26
It has to be derived from the hostname, but could maybe do some `sed`-ing.

yacovm
2016-10-25 13:26
yep

yacovm
2016-10-25 13:26
it's better IMO than doing a hacky code change in your stuff that will later might be over-written

garisingh
2016-10-25 13:26
we just end all our hostnames with -vpX

garisingh
2016-10-25 13:26
in BMX

donovanhide
2016-10-25 13:27
So I could have `peer-1-vp1`?

kostas
2016-10-25 13:28
(A sec, while I reboot the laptop.)


kostas
2016-10-25 13:32
You'll notice that it just strips away the first two characters from the handle, expect a `vpX` naming scheme. (lines 94-95)

kostas
2016-10-25 13:33
And `getValidatorHandle()` right below will have to be edited accordingly. Let me know if you any help editing those.

donovanhide
2016-10-25 13:35
@kostas Ok, thanks. I can probably munge my peer.id to match that expectation. Made the point in the other channel, that the id scheme is abusable via collisions, maliciously or accidentally. An id derived from some entropy (private key would be best) and a signed HELLO message which contains both the id and a node type probably seems a bit more secure and less prone to mistakes like I’ve been making :slightly_smiling_face:

kostas
2016-10-25 13:36
I agree. This current scheme is a terrible hack, I'm glad it's going away.

donovanhide
2016-10-25 13:37
I think we have reached consensus :slightly_smiling_face:

jyellick
2016-10-25 14:06
@tuand I just pointed @keithsmith to you to talk about the composition of the configuration in the genesis block (and at reconfiguration) so that you guys can coordinate on names and encodings etc. I think it would be a great idea if you wanted to start a document which describes the assorted configuration objects which will be embedded (both their names and encodings)

tuand
2016-10-25 14:07
thanks Jason !

tuand
2016-10-25 14:08
I started the description in FAB-665 as well

david.acton
2016-10-25 15:13
has joined #fabric-consensus-dev

donovanhide
2016-10-25 16:20
@kostas Got it working with: ```CORE_PEER_ID=$(hostname|sed 's/peer-/vp/‘)``` Thanks! Next bug :slightly_smiling_face: I’m using remote github urls to test deploying chaincode: ```peer chaincode deploy -p https://github.com/donovanhide/chaincode -c '{"Function":"init", "Args": []}’``` But am getting this in my logs: ``` 16:13:37.713 [consensus/pbft] sendBatch -> INFO 028 Creating batch with 1 requests 16:13:37.959 [consensus/pbft] executeOne -> INFO 029 Replica 0 executing/committing request batch for view=0/seqNo=1 and digest 8h4lIhV7Q1ZHBO1gMN/Z+dmkscCaDanPNxU9d0J5kC/XVwgoO9x9BU70Mc2NidVO+akGuPbTKoCeA5Q/DkZiyw== 16:13:41.961 [dockercontroller] deployImage -> ERRO 02a Error building images: Tag latest not found in repository http://docker.io/hyperledger/fabric-baseimage 16:13:41.962 [dockercontroller] deployImage -> ERRO 02b Image Output: ******************** Step 1 : FROM hyperledger/fabric-baseimage Pulling repository http://docker.io/hyperledger/fabric-baseimage ********************``` Does this mean I need to prepare my Docker VM’s for each peer with that image?

kostas
2016-10-25 16:21
@donovanhide: Cool, glad you got that one working. That second error is outside my area of expertise unfortunately. Paging @muralisr.

jyellick
2016-10-25 16:29
@donovanhide My understanding is that this image is deliberately not tagged as 'latest' remotely so that older builds do not accidentally grab it. Instead, the image is tagged as such locally during dev env construction. @ghaskins I think has limited availability, but I believe he might be the best person to ask. If you look at `./devenv/setup.sh` you'll see references to `BASEIMAGE_RELEASE` which I think gets used in the `Makefile` to pick the image source. (Sorry, this is also not my area of expertise) you might also find some help in #fabric-dev-env

donovanhide
2016-10-25 16:31
@kostas @jyellick Thanks! I’ll wait for any other replies here rather than spamming all the channels :slightly_smiling_face:

echenrunner
2016-10-25 16:32
has joined #fabric-consensus-dev

muralisr
2016-10-25 16:36
@donovanhide how did you build the peer and how are you running it ?

donovanhide
2016-10-25 16:37

donovanhide
2016-10-25 16:38
Each peer is pulling `hyperledger/fabric-peer:latest` on kubernetes and just running the peer binary

yacovm
2016-10-25 16:39
where is it pulling it from?

donovanhide
2016-10-25 16:39
So when I deploy some chaincode, it is trying to build the docker image on each peer and run it on it’s own docker container that is local. It’s pulling it from http://hub.docker.com.

donovanhide
2016-10-25 16:40
Each petset generates a peer container and a docker container within a “pod".

donovanhide
2016-10-25 16:40
Trying to simulate how a topology might work in real life.

echenrunner
2016-10-25 16:41
I'm assuming you're running a PBFT network of at least 4 peers ... are you sending transactions to the network after you restarted your peer ? that peer won't know it is lagging unless it is receiving checkpoint messages from the other peers ... we can continue the discussion to #fabric-consensus-dev

donovanhide
2016-10-25 16:41
I’m running a PBFT network of 5 peers.

echenrunner
2016-10-25 16:42
How do I sent a checkpoint message? are you testing DR in an event of Validating peer goes down?

donovanhide
2016-10-25 16:43
@echenrunner are you addressing me? :slightly_smiling_face:

echenrunner
2016-10-25 16:43
yes..

donovanhide
2016-10-25 16:45
@echenrunner I have a working PBFT network. I just can’t deploy chaincode, because of the error listed above. The http://hub.docker.com image doesn’t have the correct tag, not sure how to work with that.

kostas
2016-10-25 16:45
(@echenrunner: A checkpoint message is sent automatically every K blocks, where K can be edited in the PBFT `config.yaml`.)

donovanhide
2016-10-25 16:47
The above petset configuration might be unfamiliar. To explain, it starts up 5 pods, each of which has a docker VM container and a fabric peer container. When it tries to build the chaincode docker image on each peer container, I get that error.

echenrunner
2016-10-25 16:48
if I change form viewchangeperiod: 0 to 1 what impact does it have?

donovanhide
2016-10-25 16:50
@echenrunner Is that a question for me?

muralisr
2016-10-25 16:51
@donovanhide I ran into issues with building peer due to tagging but rather give you my (homegrown) solution, let us check with @hgabor

echenrunner
2016-10-25 16:51
anybody... thanks

muralisr
2016-10-25 16:52
(or @ramesh … see you typing)

kostas
2016-10-25 16:55
@echenrunner: As the instructions in the `config.yaml` state, that means that every K blocks all validating peers will send a view-change request so that they proceed with a new primary/leader.

kostas
2016-10-25 16:56
You are then effectively rolling with a new primary per K blocks. (Assuming no Byzantine faults in between.)

ramesh
2016-10-25 17:03
@donovanhide you should have fabric-baseimage:latest

donovanhide
2016-10-25 17:04
@ramesh where? In each of my docker containers? By pulling and then tagging the http://hub.docker.com image?

donovanhide
2016-10-25 17:05
Should explain, I’m using Docker In Docker as a test.

ramesh
2016-10-25 17:05
yes

donovanhide
2016-10-25 17:06
Ok, I can do that. Just wondering why you don’t just tag the http://hub.docker.com image? Given the default in the deploy logic is to use `:latest`

ramesh
2016-10-25 17:07
docker pull hyperledger/fabric-baseimage:x86_64-0.2.0 or other version tagged in https://hub.docker.com/r/hyperledger/fabric-baseimage/tags/ and tag with hyperledger/fabric-baseimage:latest

donovanhide
2016-10-25 17:14
Just to share some information, @hgabor pointed to this config line which is probably causing the issue: https://github.com/hyperledger/fabric/blob/v0.6/peer/core.yaml#L289

ramesh
2016-10-25 17:16
could be the reason.. we are not pushing baseimage latest tag to hyperledger docker hub account..

ramesh
2016-10-25 17:17
so you pull the baseimage with the tag we have and re-tag with latest

donovanhide
2016-10-25 17:18
It seems like this is a Go peculiarity though as Java and CAR seem to use explicit tags: https://github.com/hyperledger/fabric/blob/v0.6/peer/core.yaml#L299 https://github.com/hyperledger/fabric/blob/v0.6/peer/core.yaml#L307

hgabor
2016-10-25 17:19
maybe we should use that arch-... tag for go too

hgabor
2016-10-25 17:19
as it seems to be the same as the tags pushed to hub

kostas
2016-10-25 17:41
@muralisr (I figured out what the issue is by the way, working on a fix now.)

muralisr
2016-10-25 17:41
@kostas thanks much

grbulat
2016-10-25 17:44
has joined #fabric-consensus-dev

senthil
2016-10-25 18:48
has joined #fabric-consensus-dev

donovanhide
2016-10-25 20:18
Anyone had issues with the building of chaincode being incredibly slow? ```19:43:58.136 [consensus/pbft] ProcessEvent -> INFO 027 Replica 0 batch timer expired 19:43:58.136 [consensus/pbft] sendBatch -> INFO 028 Creating batch with 1 requests 19:43:58.373 [consensus/pbft] executeOne -> INFO 029 Replica 0 executing/committing request batch for view=0/seqNo=1 and digest HjKuvhOoO7kuwpNAUDLpRuLIi/zLmBJr2xvZENIbzDlzFVYS4hRTZVLGH8P0U+qaXe+CWx6n068bhEwtvzVfew== 20:11:30.383 [consensus/pbft] execDoneSync -> INFO 02a Replica 0 finished execution 1, trying next``` It’s taken 28ish minutes to do the first part of the docker image build! I’m running 5 peers on 3x3.75GB Google Cloud boxes. Must be doing something wrong :slightly_smiling_face:

jyellick
2016-10-25 20:21
@donovanhide I have not, though as you see, the transaction makes it through consensus to the `executing/committing` phase, so you might want to try on #fabric-dev to reach a broader audience

donovanhide
2016-10-25 20:21
Okay, thanks, will re-post!

simon
2016-10-26 11:08
hi

garisingh
2016-10-26 11:14
hey @simon - only a few days left for you?

simon
2016-10-26 11:14
no, more than a month left

garisingh
2016-10-26 11:14
ah - end of Nov?

simon
2016-10-26 11:14
yep

garisingh
2016-10-26 11:14
ah - good

simon
2016-10-26 11:14
trying to get a picture what has happened

garisingh
2016-10-26 11:16
I'll let @jyellick and/or @kostas fill you in w.r.t consensus, but the biggest thing that has happened recently is that `feature/convergence` has been sunset and all the latest and greatest code is in `master` now

simon
2016-10-26 11:17
so we merged it?

garisingh
2016-10-26 11:17
and @muralisr did a great job starting to get rid of a bunch of old code in master as well

simon
2016-10-26 11:17
great

garisingh
2016-10-26 11:17
so progress has been made in the right direction

yacovm
2016-10-26 13:01
welcome back simon

simon
2016-10-26 13:01
hi yacovm

simon
2016-10-26 13:01
how's gossip going

yacovm
2016-10-26 13:02
I think I'll push the rest of the code by end of this week

simon
2016-10-26 13:03
cool

yacovm
2016-10-26 13:03
There is also a state-transfer layer on top of gossip, we haven't connected it yet but it shouldn't be a problem. A person in my squad is working on integrating it with the fabric as we speak, but he can't run it until I solve a certain bug ( I know how, coding it)

tuand
2016-10-26 13:05
hi simon, hope you had a good couple weeks off

simon
2016-10-26 13:06
i got a bit sick on amtrack - stuck air conditioner

simon
2016-10-26 13:06
and boy these transatlantic flights are tiring

tuand
2016-10-26 13:06
you were in US ?

simon
2016-10-26 13:06
yea

simon
2016-10-26 13:06
KS and IL

garisingh
2016-10-26 13:07
@simon - welcome to my life :wink: flying is for the birds

tuand
2016-10-26 13:07
we should have had a squad meetup where you went :slightly_smiling_face:

simon
2016-10-26 13:08
trucking everybody to chicago?

simon
2016-10-26 13:08
:slightly_smiling_face:

garisingh
2016-10-26 13:09
I was in Chicago 2 weeks ago - no call? just kidding

tuand
2016-10-26 13:12
so simon, you should catch up with @hgabor if you haven't done so

tuand
2016-10-26 13:13
we decided to prioritize kafka as the first orderer so @kostas is working on that

simon
2016-10-26 13:13
i am talking with gabor, yes

simon
2016-10-26 13:14
only kostas?

tuand
2016-10-26 13:14
lots of discussion regarding bootstrap and policies and multi-channels , @jyellick is handling the policy mgr among other things

tuand
2016-10-26 13:15
the multi-channel discussion is summarized here https://wiki.hyperledger.org/community/fabric-design-docs

tuand
2016-10-26 13:15
and in jira epic whose number escapes me at present

tuand
2016-10-26 13:16
bootstrapping is in fab-359

tuand
2016-10-26 13:16
kostas and luis

tuand
2016-10-26 13:16
i'm helping out on bootstrapping, so is jeff

simon
2016-10-26 13:16
configuration change policies you mean?

tuand
2016-10-26 13:17
right

simon
2016-10-26 13:17
so the kafka orderer is working?

tuand
2016-10-26 13:18
kostas is working out issues for the shim between kafka and peer ... i'll let him say how it's currently working

simon
2016-10-26 13:22
i'm just asking because if kafka is the main focus we should probably get kafka into working state instead of thinking about reconfig which is several months out

tuand
2016-10-26 13:25
np, we're looking at config policies for bootstrap and multi-channel, just happens that reconfig can reuse the same protobufs/code

tuand
2016-10-26 13:27
there's also another discussion with @elli and @adc on access control but I haven't been keeping up

simon
2016-10-26 13:29
so what happened with keith's bootstrapping stuff?

elli
2016-10-26 13:33
@tuand we will be posting updates on the access control soonish

elli
2016-10-26 13:33
hopefully later today :slightly_smiling_face:

tuand
2016-10-26 13:34
the full story in the comments of fab-359 ... short version is we manually create a genesis block containing orderer certs, peer CA certs, default policies & orderer config. Orderer read that genesis block in and start the chain. orderer allow peer to connect if using a CA cert listed in genesis block

simon
2016-10-26 13:35
okay

simon
2016-10-26 13:36
and modification of the config is done via the policy

tuand
2016-10-26 13:36
ya

jyellick
2016-10-26 13:57
@simon Reconfig is not as far out as you think, since the orderer needs to do ACL enforcement, and ACLs may change.

simon
2016-10-26 13:58
is the ACL still what you proposed?

jyellick
2016-10-26 14:00
The SignaturePolicy stuff got merged, though the layer which utilizes it is still sitting out there in Gerrit

jyellick
2016-10-26 14:00
For the config layer, I made the policy type a 'oneof' so that we can easily swap in some policy which isn't that 'trivial signature DSL' if someone comes up with something better

simon
2016-10-26 14:05
yea

simon
2016-10-26 14:05
tho i think this will go a long way

jyellick
2016-10-26 14:17
That is my hope. At the very least, it should allow us to move forward until someone finds a deficiency it cannot address

garisingh
2016-10-26 14:38
that was my take which is why I figured we should merge it now and give it a go

jyellick
2016-10-26 15:39
@simon I'm unsure why you dislike the orderer pruning approach of 'prune at config txs'. This seems extremely elegant to me. If a config tx contains all the chain config, then the whole chain going forward can be verified by policy, and the whole network can bootstrap based on that config. What am I missing?

simon
2016-10-26 15:40
i guess that could work, but i think that pruning is antithetical to a block chain

simon
2016-10-26 15:41
somebody needs to keep around a non-pruned chain

simon
2016-10-26 15:41
not for the orderer, but for the application using the chain

jyellick
2016-10-26 15:42
Definitely, the application needs to retain these forever, there is no way around that.

jyellick
2016-10-26 15:42
I was purely thinking 'orderer pruning'

jyellick
2016-10-26 15:43
Especially for the as a service case.

jyellick
2016-10-26 15:43
My other thought too was, bootstrapping a new peer, give them the latest config block, then it can bootstrap based on that to know about who is a member of the peer network, pull new blocks from ordering, and pull old ones from peers

simon
2016-10-26 15:45
ok

simon
2016-10-26 16:07
so i'm working on the hello on connect

yacovm
2016-10-26 16:07
what hello on connect?

yacovm
2016-10-26 16:07
If I may ask?

simon
2016-10-26 16:07
when replicas reconnect, they exchange information to allow the one that is behind to get up to speed

yacovm
2016-10-26 16:08
you're talking about the peer or something else here?

simon
2016-10-26 16:08
no, i don't care about the peer

simon
2016-10-26 16:08
consensus

yacovm
2016-10-26 16:08
oh ok

yacovm
2016-10-26 16:09
was alarmed, because: 1) I submitted a PR that is related to hello messages on connections, and: 2) The gossip is going to be used to sync peers Wanted to make sure we don't step on each others feet

simon
2016-10-26 16:09
yea, no peer messages

simon
2016-10-26 16:10
now the thing is that we may be in a different view by now

simon
2016-10-26 16:10
and the question is, who sends the new view message?

simon
2016-10-26 16:10
only the primary?

simon
2016-10-26 16:10
what if we can't connect to the primary?

simon
2016-10-26 16:10
all replicas?

simon
2016-10-26 16:10
then we receive a lot of data

jyellick
2016-10-26 16:12
I almost feel like handshake should be two phase

simon
2016-10-26 16:13
yea but then it requires state

simon
2016-10-26 16:13
and becomes synchronous

jyellick
2016-10-26 16:14
Hmmm, so, my thought was that because we have signed blocks, we can play state forward now. On hello, everyone advertises their block height, and those who are behind can ask for those blocks. If they request the blocks from someone who advertised a blockheight and they do not reply, or reply with bad blocks, then you know that advertisement was from a byzantine replica and move on

simon
2016-10-26 16:15
sure, that's state transfer

jyellick
2016-10-26 16:15
Once the non-faulty replicas are all at the same block height, things seem fairly easy?

simon
2016-10-26 16:15
i assume that's handled

simon
2016-10-26 16:16
i need to sync the view

simon
2016-10-26 17:02

simon
2016-10-26 17:03

jyellick
2016-10-26 17:30
Done

tuand
2016-10-26 18:55

kostas
2016-10-26 18:57

tuand
2016-10-26 18:59
that's the one ... thanks kostas

kostas
2016-10-26 18:59
Sure thing.

tuand
2016-10-26 19:00
reading changeset 1817 which has a reference to configtx

jyellick
2016-10-26 19:04
I realized that when I re-submitted the changesets to master, I hadn't added anyone back as reviewers, so you may have just gotten quite a bit of spam as I fixed that up, apologies.

garisingh
2016-10-26 19:30
do I need to go over your code with a fine tooth comb? :wink:

bsm117532
2016-10-26 20:12
has joined #fabric-consensus-dev

jyellick
2016-10-26 20:25
Apparently adding that comment on `Evaluate` to an earlier changeset caused merge conflicts all down the line, so had to rebase, sorry about that

simon
2016-10-27 08:12
jyellick: any particular reason why you didn't +2 the changesets you reviewed?

simon
2016-10-27 08:12
also no need to -1 if verified fails

tom.appleyard
2016-10-27 08:17
dumb question but the mockstub, does this work for fabric 0.5

simon
2016-10-27 08:19
where what?

nitin
2016-10-27 09:32
has joined #fabric-consensus-dev

tuand
2016-10-27 13:58
scrum hangout ...

2016-10-27 13:59
@tuand has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/wh7v7zj4avfpvj4nrtahlf5ajie.

tuand
2016-10-27 14:00
@simon ?

tuand
2016-10-27 14:00
@hgabor ?

tuand
2016-10-27 14:00
@yacov ?

tuand
2016-10-27 14:00
@yacovm ?

jyellick
2016-10-27 14:01
@simon Since the build was failing, did not want to +2 (agree, maybe the -1 was unnecessary), and I commented as to why on https://gerrit.hyperledger.org/r/#/c/2025/ the commit message does not contain a JIRA reference, and could be a bit more informative (Just following what I have seen from other reviewers here, have had my changesets not +2-ed for this reason)

yacovm
2016-10-27 14:01
yes?

yacovm
2016-10-27 14:01
sorry

yacovm
2016-10-27 14:01
all the squad are in my room atm, they distracted me :slightly_smiling_face:

c0rwin
2016-10-27 14:03
@yacovm oh, here we start with excuses.

kostas
2016-10-27 14:09
@jeffgarratt Can you remind me what is your suggestion w/r/t multiple channels and their respective Broadcast/Deliver streams?

kostas
2016-10-27 14:09
I'm stubbing out support for this now

jeffgarratt
2016-10-27 14:37
@kostas I would think it may be simpler to use a different port for each channel at first

jeffgarratt
2016-10-27 14:38
could always consolidate if necessary

kostas
2016-10-27 14:46
@jeffgarratt: A different port for the connection between the gRPC client (the peer) and the gRPC server (the shim)?

jeffgarratt
2016-10-27 14:49
yes, not sure how else you could serve the same service over the same port with GRPC

kostas
2016-10-27 14:51
Well, we could do it all on the same port, and add logic on both the client and the server that filters on the channel ID. (But I remember the concern about resource starvation when multiplexing.)

kostas
2016-10-27 14:52
Let's Hangout real quick if you have time?

jyellick
2016-10-27 14:58
@kostas @jeffgarratt I don't understand the resource starvation when multiplexing a broadcast. It is a client, if it wants to try to resource starve itself, who cares? With respect to deliver, we simply update the `SeekInfo` to specify a chainID, and then invoke multiple delivers at the client side. What am I missing?

kostas
2016-10-27 14:59
This is what I'm trying to figure out?

jyellick
2016-10-27 15:00
Sorry, I meant the question to be targeted at Jeff, just calling your attention to it

jeffgarratt
2016-10-27 15:09
I was concerned about the orderer side wrt to QoS

jeffgarratt
2016-10-27 15:10
does that make sense? @jyellick

jyellick
2016-10-27 15:10
I'm afraid it does not

jeffgarratt
2016-10-27 15:11
hangout with Kostas?

jyellick
2016-10-27 15:11
Sure

2016-10-27 15:11
@jeffgarratt has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/23s7hw2frjasllzi4jex6cpu4ee.

2016-10-27 15:11
@kostas has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/6uvwjvozorejhhwqkzhttcmkrme.

2016-10-27 15:11
@jyellick has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/4ugq4lgfuzc3bmtwwbz52otonqe.

jyellick
2016-10-27 15:11
Wow...

tuand
2016-10-27 15:11
you guys ...

jeffgarratt
2016-10-27 15:11
:wink:

jeffgarratt
2016-10-27 15:11
join hasons?

jeffgarratt
2016-10-27 15:11
jasons

muralisr
2016-10-27 15:11
:slightly_smiling_face:

muralisr
2016-10-27 15:12
I’ll stay away and focus on using the new protos integration :slightly_smiling_face:

muralisr
2016-10-27 15:12
last thing I want to do is join 3 hangouts :wink:

jeffgarratt
2016-10-27 15:13
on Jasons :slightly_smiling_face:

muralisr
2016-10-27 15:23
haha … indeed :slightly_smiling_face:

kostas
2016-10-27 16:22
In the Kafka case, do we want the addition of a new orderer to refer to (a) another shim, (b) a Kafka broker, or (c) either/or, or (d) always both?

kostas
2016-10-27 16:23
The more I think about it, the more I think that (c) is the way to go.

jyellick
2016-10-27 16:51
Instinctivey I would say (a), though (c) provides a convenient mechanism to reconfigure shims. I would say not (b) and not (d).

garisingh
2016-10-27 17:01
c)

garisingh
2016-10-27 17:02
minimally a)

garisingh
2016-10-27 17:02
Sorry - MVP would need to do c) but I think we could start with a)

garisingh
2016-10-27 17:03
for adding Kafka brokers, we should be able to leverage Kafka capabilities. For a) there is work on our side

kostas
2016-10-27 17:05
Alright, so same page here. And that is indeed the plan with Kafka brokers, (but I need to add support for Metadata requests to the code).

kostas
2016-10-27 19:51
Some thoughts before I turn this into a JIRA issue.

kostas
2016-10-27 19:55
We need TLS connections between the shims and the Kafka brokers.

kostas
2016-10-27 19:56
But there's no API to set the Kafka ACLs (note: this is different than the ACLs that _we_ have been talking about so far; these are maintained on the shim level).

kostas
2016-10-27 19:56
The way you update Kafka ACLs is by executing a script.

kostas
2016-10-27 19:57
So every time we add/remove a shim, we'll need to execute this script.

kostas
2016-10-27 19:57
Sounds a bit flaky, but wanted to check if there are any thoughts or alternative approaches to it.

garisingh
2016-10-27 20:02
@kostas - just to clarify - on the Kafka broker side - we want to: 1) Require TLS 2) Require client authentication between the shim(s) and Kafka Do we want to explicitly limit access to Kafka topics or are you good with allowing any authenticated client (e.g. Shims) to do anything?

kostas
2016-10-27 20:03
At the risk of missing something, I'd say I'm good with allowing any authenticated shim to do anything.

jyellick
2016-10-27 20:04
I agree, I see no reason for restriction

garisingh
2016-10-27 20:07
OK - cool. So if we: 1) Only enable the TLS (SSL in Kafka terms) listener on Kafka 2) We require client certificates 3) Then we should be able to simply configure Kafka brokers with a list of CAs to trust

kostas
2016-10-27 20:08
But if these CAs change, then the `kafka-acls.sh` script needs to be executed. That's what I'm getting at.

garisingh
2016-10-27 20:14
well I did not think that that script actually deals with the truststore - although given the truststore is going to be a Java Key Store, you'll still need a script to add / remove trusted certificates from there

garisingh
2016-10-27 20:14
I think we can avoid Kafka ACLs with TLS client authentication only, but you'll need to be able to modify the keystore

garisingh
2016-10-27 20:15
and I am not sure if it is statically loaded at runtime

garisingh
2016-10-27 20:15
Does the Go Kafka client support any of the SASL mechanisms?

binhn
2016-10-27 20:16
@kostas is it possible to config brokers to accept only local connections so that we would force shim to be on the same box/vm?

binhn
2016-10-27 20:17
and would that be good enough to remove the requirement of ssl?

garisingh
2016-10-27 20:20
you need to have TLS in any case for broker to broker communcation

garisingh
2016-10-27 20:20
same listener(s)

kostas
2016-10-27 21:59
@binhn: Gari's correct, you'll need TLS for comms between brokers.

kostas
2016-10-27 22:01
@garisingh: I looked it up and you are correct that the `kafka-acls.sh` script does _not_ modify the truststore.

kostas
2016-10-27 22:01
But yeah you'll need a script to manage it nonetheless, so we're back to square one.

kostas
2016-10-27 22:02
The library that we are using does support SASL, but I am not familiar with the underlying mechanism. Should I be looking into it? (And is there a one-liner as to what makes it better?)

garisingh
2016-10-27 22:27
@kostas - nothing inherently makes it "better" - they chose SASL because its a pluggable authentication layer. They support Kerberos and Plain (username / password) SASL mechanisms today. I think there is some flexibility in adding usernames / password (the BMX Message Hub use the Plain mechanism and I think they are able to easily add credentials programmatically)

garisingh
2016-10-27 22:28
I am not sure if Kafka would reload the keystore if it is modified either - likely it does - and there are utilities (key tool) which can be used from the command line / exec functions to add certs as well

garisingh
2016-10-27 22:29
so TLS with client certificates and this utility might be straightforward. There may even be key tool source out there (likely C) so perhaps you can just link it in to the shim as well

simon
2016-10-28 11:24
somehow i need to stay "inactive" when restarting/reconnecting, or i might believe that i'm the primary of a view that has passed

simon
2016-10-28 11:25
i can sort of work around this by waiting for a (stored) new view message from a replica

simon
2016-10-28 11:25
but how do i deal with the initial start?

simon
2016-10-28 11:25
special code the situation that we don't have a stored new-view message and we are replica 0, therefore we are in view 0?

jyellick
2016-10-28 13:08
@simon What about bootstrapping with a constructed 'new-view' message in the log?

hgabor
2016-10-28 13:09
@jyellick it would be good if you could have a look at these: https://gerrit.hyperledger.org/r/#/c/2037/ , https://gerrit.hyperledger.org/r/#/c/2065/

jyellick
2016-10-28 13:10
@hgabor I'll take a look

hgabor
2016-10-28 13:11
thanks

jyellick
2016-10-28 13:34
Replied with a couple comments, mostly looks good though

vukolic
2016-10-28 15:35
@simon I know you have been waiting for this one :wink: https://jira.hyperledger.org/browse/FAB-897

simon
2016-10-28 15:41
lol

simon
2016-10-28 15:41
priority high?

vukolic
2016-10-28 16:10
Not sure whats the absolute meaning of those priorities

vukolic
2016-10-28 16:10
However as it impacts for example hello

vukolic
2016-10-28 16:10
It is better to get it earlier than later

vukolic
2016-10-28 16:20
Can be medium

vukolic
2016-10-28 16:53
@simon it impacts https://jira.hyperledger.org/browse/FAB-478 you are working on now

msoumeit
2016-10-30 19:11
has joined #fabric-consensus-dev

simon
2016-10-31 07:25
@vukolic i don't think we should complicate things now before we finish the MVP

vukolic
2016-10-31 08:23
@simon MVP should have pipelining

vukolic
2016-10-31 08:23
IMO

simon
2016-10-31 08:24
i don't think that will happen, realistically

vukolic
2016-10-31 08:26
why not

vukolic
2016-10-31 08:27
if you do not wish to focus on that one - that's fine - pls focus on other JIRA items

vukolic
2016-10-31 08:27
yet eventually pipelining should be in MVP

simon
2016-10-31 08:28
MVP is march, right?

vukolic
2016-10-31 08:30
yes

vukolic
2016-10-31 08:30
approximately - I'd say

simon
2016-10-31 08:35
so realistically nothing happens around dec/jan, so that's one month out, one month testing/feature freeze, leaves nov, jan, feb

simon
2016-10-31 08:35
so maybe it'll happen

vukolic
2016-10-31 08:43
We just need to discuss https://jira.hyperledger.org/browse/FAB-478 and how would one implement that with pipelining

vukolic
2016-10-31 08:44
we need a simple solution there (I think that resending pre-prepare, prepare, commit msgs might be overkill)

vukolic
2016-10-31 08:44
that works with no pipelining but with pipelining it becomes more involved because of the buffering

vukolic
2016-10-31 08:45
so basically we may need to address https://jira.hyperledger.org/browse/FAB-478 in a diff way

simon
2016-10-31 08:50
hm

simon
2016-10-31 08:50
i think it is fair to assume that reconnect events should be rare

simon
2016-10-31 08:50
compared to the overall network activity

simon
2016-10-31 08:51
so resending the in flight messages is a simple solution

simon
2016-10-31 08:52
the alternative would be to wait for all in flight requests to finish

simon
2016-10-31 08:52
but they may not finish (but timeout, because of reconnecting nodes waiting)

vukolic
2016-10-31 09:01
this state transfer part is the most involved part of pipelining

vukolic
2016-10-31 09:01
the rest is making guard(s) such as (in preprepare.go, line 48)

vukolic
2016-10-31 09:01
nextSeq := s.nextSeq() if *pp.Seq != nextSeq { log.Infof("preprepare does not match expected %v, got %v", nextSeq, *pp.Seq) return }

vukolic
2016-10-31 09:01
aware of different counters

vukolic
2016-10-31 09:01
and the replica would simply have different counters for each msg type

vukolic
2016-10-31 09:01
and an adapted .nextSeq() function

simon
2016-10-31 09:06
right

simon
2016-10-31 09:07
we already track request state partitioned

simon
2016-10-31 09:07
right now i'm trying to working out some issues with hello on reconnect

simon
2016-10-31 09:08
and which messages to discard and which ones to put in the backlog

simon
2016-10-31 09:17
the problem is that i might discard messages that are about the future, because i'm not yet synced up with it

simon
2016-10-31 09:18
so i think on hello i might have to directly sync my state

simon
2016-10-31 09:18
hm

vukolic
2016-10-31 09:18
wdy mean by "directly syncing"

simon
2016-10-31 09:18
which is different from what i had planned

simon
2016-10-31 09:19
well, instead of processing my outstanding message before looking at the hello message

simon
2016-10-31 11:03
i think i need to send the new-view on connect in a different way

simon
2016-10-31 11:03
because it really happens out of sequence

simon
2016-10-31 11:13
oh i think we shouldn't discard messages if we're not active

simon
2016-10-31 11:13
because the new-view message might arrive later than the prepare/commit for requests in the new view

jyellick
2016-10-31 13:44
Have an appointment during scrum time, so will report here. Finally found agreement about message format for the orderer. Conclusion was simple envelope with sig/payload/header. With respect to ASN.1 vs Protobuf, the conclusion was to try to wrap the structure access in utility methods and avoid direct proto marshaling/unmarshaling to make migration to another encoding easier, but to stick to Protos for the time being. I'll be reworking that WIP changeset which uses a simpler envelope to use protos with finalized names, and hopefully when merged, will also try to help the fabric adopt.

tuand
2016-10-31 13:59
scrum ...

simon
2016-10-31 13:59
that's silly

2016-10-31 13:59
@tuand has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/4cki4gjnlfgoxfgt6j6j6q4hxae.

simon
2016-10-31 13:59
oh scrum changed because of daylight savings?

yacovm
2016-10-31 14:00
yeah

tuand
2016-10-31 14:00
daylight savings ends next weekend in us

hgabor
2016-10-31 14:47
On holiday today, won't participate in scrum. Working with @simon on other sbft tasks.

simon
2016-10-31 14:47
you're too late for scrum anyways :stuck_out_tongue:

hgabor
2016-10-31 14:49
Yeah now I see :-D

echenrunner
2016-10-31 18:51
Guys, I think I have a work around to insure that either the user gets the "Error: state may be inconsistent, cannot query" or the latest ledger is by invoking a dummy chaincode to force "syncBlocks" to be invoked by the peer that was down. it seems, if a member firm that brought down their validating peer they will need to issue a "dummy chaincode" to get the latest blocks as part of their operation procedures.

jyellick
2016-10-31 19:25
@echenrunner You may want to look at enabling periodic null requests `CORE_PBFT_GENERAL_TIMEOUT_NULLREQUEST=3s` which does the same thing, but without the dummy chaincode or manual intervention

binhn
2016-10-31 19:58
on the peer, we need to call the ordering service, is there a client side and where is it?

kostas
2016-10-31 19:59
@binhn: There is. If you look into the `orderer/sample_clients` directory, you'll find examples on how to invoke the ordering service.

kostas
2016-10-31 20:02
You basically import the `atomicbroadcast` package, do a `atomicbroadcast.NewAtomicBroadcastClient(conn)` to instantiate a gRPC client and proceed as usual from there.

binhn
2016-10-31 20:07
cool, thanks

kostas
2016-10-31 20:07
Sure thing.

echenrunner
2016-10-31 20:10
Jyellick, Good news and bad news, I added the nullrequest to 5 seconds on my all 4 VPs . bring one down(vp2) and start the invoke again on all three peers. now I bring up vp2 and wait, it get the latest. However, Bad news, I started the invoke again, what I notice i'm in a loop on all 4 peers of 15:59:57.541 [consensus/pbft] recvViewChange -> WARN 1f7 Replica 0 found view-change message incorrect 15:59:58.433 [consensus/pbft] ProcessEvent -> INFO 1f8 Replica 0 processing event 15:59:58.434 [consensus/pbft] sendViewChange -> INFO 1f9 Replica 0 sending view-change, v:95, h:76, |C|:2, |P|:1, |Q|:1 15:59:58.436 [consensus/pbft] recvViewChange -> INFO 1fa Replica 0 received view-change from replica 0, v:95, h:76, |C|:2, |P|:1, |Q|:1 15:59:58.437 [consensus/pbft] recvViewChange -> WARN 1fb Replica 0 found view-change message incorrect

jyellick
2016-10-31 20:11
@echenrunner Which level of code are you running with? There is a known bug which was fixed which could cause this behavior

echenrunner
2016-10-31 20:12
the one I download about few weeks ago. I do I check the level

echenrunner
2016-10-31 20:14
HyperledgerVP0:/opt/gopath/src/github.com/hyperledger/fabric # peer --version 16:14:00.541 [logging] LoggingInit -> DEBU 001 Setting default logging level to DEBUG for command 'peer' Fabric peer server version 0.7.0-snapshot-448d207 16:14:00.544 [main] main -> INFO 002 Exiting..... HyperledgerVP0:/opt/gopath/src/github.c

jyellick
2016-10-31 20:14
Do you see a line at the beginning of the log that looks like: ? ``` Replica %d restored state: view: ... ```


jyellick
2016-10-31 20:17
The fact that `h` is 76 in your debug output makes me think that you have this fix, but I think it's worth verifying

jyellick
2016-10-31 20:27
Looking more closely at your logs, it looks to me like vp0 has advanced its view while the rest of the network has not. I'm not sure why it is sending a malformed view change, but I would expect that as the rest of the network advances, vp0 will state transfer, and this message will go away

jyellick
2016-10-31 20:28
Assuming that you have K=10, and log multiplier of 4. Then I would expect this to happen within 5s * 10 * (4+1) = 250s ~ 4 minutes.

echenrunner
2016-10-31 20:29
i see the "restored state" and it's in line 174

jyellick
2016-10-31 20:33
What does the line read?

echenrunner
2016-10-31 20:37
logger.Infof("Replica %d restored state: view: %d, seqNo: %d, pset: %d, qset: %d, reqBatches: %d, chkpts: %d h: %d", instance.id, instance.view, instance.seqNo, len(instance.pset), len(instance.qset), len(instance.reqBatchStore), len(instance.chkpts), instance.h)

jyellick
2016-10-31 20:37
Looks like you have that patch. Are you still seeing the 'message incorrect' lines?

echenrunner
2016-10-31 20:39
no... I kill all the VPs process

jyellick
2016-10-31 20:39
I'd encourage you to take a look at https://jira.hyperledger.org/browse/FAB-707

echenrunner
2016-10-31 20:39
okay... thnaks

jyellick
2016-10-31 20:39
Essentially, it's possible to have one peer which erroneously votes to change views early because of a crash or network problem

jyellick
2016-10-31 20:40
This will cause the continual issuing of view change messages.

jyellick
2016-10-31 20:40
The logs look a little spammy, but from a protocol perspective, this is benign.

echenrunner
2016-10-31 20:41
that what i was simulating, I kill the the peer without "peer stop"

jyellick
2016-10-31 20:41
The replica's state will continue to be periodically synced, and once the network does view change, it will start participating again

jyellick
2016-10-31 20:41
I'm a little curious about how it is constructing an incorrect view-change message

jyellick
2016-10-31 20:41
I would not have anticipated that.

echenrunner
2016-10-31 20:46
Thanks for that links... I am reading it now

teddy
2016-11-01 09:53
has joined #fabric-consensus-dev

simon
2016-11-01 10:00

jzhang
2016-11-01 15:30
@jyellick @simon i noticed that orderer.yaml defaults the listening port to `5151`, but the docker-compose overrides it to `5005`, is there a reason for this? why not decide on one of them and go with it everywhere? from SDK point of view we’d like to decide on which one to use for the test cases. right now our tests are set up with docker-compose so it makes sense to use 5005. but having 5151 as default can break contributors who fire up SOLO orderer as native process.

jyellick
2016-11-01 15:32
@jzhang As I recall, 5005 was picked initially and arbitrarily as a stand-in, but someone, I think @cbf pointed out that this conflicted with another service we use and asked for it to be changed. So, when the configuration went in, it was changed to 5151. I'm not really sure what process we use for picking ports, but I agree, we should pick a port and fix it in both places.

yacovm
2016-11-01 15:32
Is block validation coded at the moment?


yacovm
2016-11-01 15:33
(does anyone here know, or know who writes that aspect?)

jzhang
2016-11-01 15:33
should i file a JIRA bug for this change?

cbf
2016-11-01 15:33
@jyellick I chose from the list of assigned ports a range of 10 that were unassigned 7050-7060

kostas
2016-11-01 15:34
@jzhang Yes please.

cbf
2016-11-01 15:34
I asked a while back but did not get an answer as to why we chose 50xx range because those are taken

cbf
2016-11-01 15:34
yes, we should use from the set I identified earlier

cbf
2016-11-01 15:34
I don’t know why we would use another range

cbf
2016-11-01 15:34
when all is said and done we can allocate the range we are using

jyellick
2016-11-01 15:35
Great, let's move to that range, I didn't understand 7050-7060 was what was chosen

jyellick
2016-11-01 15:35
What port specifically should the orderer use? I assume we have already allocated some of 7050-7060 for some services?

kostas
2016-11-01 15:35
Are other components using ports within that range?

kostas
2016-11-01 15:35
Oh damn it, too slow.

cbf
2016-11-01 15:35
well, a bunch are used, yes


kostas
2016-11-01 15:35
Are these listed somewhere?

cbf
2016-11-01 15:36
but with the refactor, we need to assess what is needed going forward

cbf
2016-11-01 15:36
eg REST endpoint is deprecated, no?

cbf
2016-11-01 15:36
in the PR where I changed things;-)

jzhang
2016-11-01 15:36
REST APIs are already stripped out of master

jzhang
2016-11-01 15:36
so 7050 are free now

kostas
2016-11-01 15:36
Well, it'd be great if you can link to that PR in Jim's issue.

jzhang
2016-11-01 15:37
7051 is peer grpc, 7053 is peer’s event stream, 7054 right now is member service grpc


cbf
2016-11-01 15:38
so the proliferation of CA endpoints is, I believe no longer desired

kostas
2016-11-01 15:38
I updated the issue accordingly, thx.

simon
2016-11-01 15:39
waits for jenkins to even start

jzhang
2016-11-01 15:39
what’s “7052 peer cli” used for? shouldn’t peer cli use “7051” just like SDKs?

jyellick
2016-11-01 20:22
@kostas @binhn Also posting here to prevent bad threading in #fabric-crypto The second piece, is how is the authority to create channels tracked, and how is channel creation serialized against this to prevent non-deterministic channel creation while not leaking the channel creation to those who should not witness it. (ie, doing this in the system ledger is probably not an option)

binhn
2016-11-01 22:15
@elli called it “superuser”, a leader node that processes all config transactions and disseminates to followers

kostas
2016-11-01 22:21
@binhn: I am not sure this addresses the concern raised here. You ultimately need a log where the configurations specifying who can create a channel, along with the actual requests/config_txs to create channels, are maintained. And that second part makes the system chain (which was the obvious candidate) not an option because of leaking concerns.

ermyas
2016-11-01 22:41
has joined #fabric-consensus-dev

simon
2016-11-02 08:25
still looking for reviews for my sbft changesets

elli
2016-11-02 09:20
@kostas: if there is a single node that receives all these requests, isnt easier that we enforce that these requests are handled sequentially?

elli
2016-11-02 09:21
For simplicity and for the beginning since kafka already have this notion of cluster administrator no?

hgabor
2016-11-02 12:14
could somebody review my changesets?

kostas
2016-11-02 12:28
@elli: Well, for Kafka, this is an easy problem to solve. You just assign a partition to these requests, and then it's all up to the partition leader replica to enforce order. No need to tie this specifically to the cluster controller.

kostas
2016-11-02 12:30
What was the exact flow you had in mind?

azaleta
2016-11-02 13:00
has joined #fabric-consensus-dev

seshadrs
2016-11-02 14:04
has joined #fabric-consensus-dev

jyellick
2016-11-02 14:11
@kostas I'd still like to talk to you about the race between channel creation and channel creation rights when you have a chance

elli
2016-11-02 14:12
right these are the messages that need to be serialized somehow

kostas
2016-11-02 14:12
I am here. And responded to Elli as well.

elli
2016-11-02 14:13
in the sense that orders w.r.t. changes in the permissions of entities, w.r..t channel creation/termination should be received in the same order by all the peers who are to process channel creatiuon/termination requests.

kostas
2016-11-02 14:14
As long as all the shims read from the same partition, this concern w/r/t ordering is taken care of, correct?

jyellick
2016-11-02 14:15
Not quite

jyellick
2016-11-02 14:15
So, first question is: Where is the channel creation authorization policy stored?


kostas
2016-11-02 14:16
Before we proceed further.

kostas
2016-11-02 14:16
Just to make sure we're all on the same page.

jyellick
2016-11-02 14:16
Okay. So, I'll enumerate the flow with the race in it.

kostas
2016-11-02 14:16
Hold on a sec --

kostas
2016-11-02 14:16
It's my understanding based on this https://hyperledgerproject.slack.com/archives/fabric-consensus-dev/p1478078423003055 and Binh's message that Elli has a flow in mind that solves this. Is that correct or not?

elli
2016-11-02 14:18
aha, the flow in mind related to either have a central admin that process all these requests one by the other

elli
2016-11-02 14:18
and in this case this entity will accept and process channel creation requests as well as channel-related permission config messages

elli
2016-11-02 14:18
or that you have a chain among orderers to announce these requests

elli
2016-11-02 14:18
and guarantee that all of themsee it in the sasme order

jyellick
2016-11-02 14:19
The dedicated orderer chain is I think the likely correct answer

kostas
2016-11-02 14:19
Correct.

elli
2016-11-02 14:19
Either way in my mind.

elli
2016-11-02 14:19
If centralized solution is easier fornow

elli
2016-11-02 14:19
we can also go for this

kostas
2016-11-02 14:20
So you are leveraging the ordering guarantees of the partition, and have a partition dedicated to all the channel-related messages (configs + creation reqs).

elli
2016-11-02 14:20
right

elli
2016-11-02 14:20
but assumin that this chain is internal

elli
2016-11-02 14:20
is not exposed to others

elli
2016-11-02 14:20
to peers

kostas
2016-11-02 14:20
I am fully with you.

kostas
2016-11-02 14:22
We had brought this solution up when we settled on the multi-channel JoinChannel API.

kostas
2016-11-02 14:23
A question though is:

kostas
2016-11-02 14:26
These configuration messages that dictate the policies for channel creation, as well as the orgs that can request them. Do you only see those stored in that special partition?

kostas
2016-11-02 14:27
Because as far as I can tell, this is stuff that also belongs on the system chain.

kostas
2016-11-02 14:27
So, if my understanding is correct, there's some cross-posting that needs to happen here.

garisingh
2016-11-02 14:28
isn't the idea to basically have a "dedicated" API for config (which would of course take as a parameter the channel(s) it applies to)?

kostas
2016-11-02 14:29
(And then you come up with the usual race conditions that we saw with the whole cut block / push block mechanisms in Kafka and multiple channels.)

jyellick
2016-11-02 14:29
@garisingh Config changes are implemented as a special transaction today

kostas
2016-11-02 14:29
@garisingh Can you expand on that?

jyellick
2016-11-02 14:29
Which comes in over the normal `Broadcast` API

garisingh
2016-11-02 14:36
@kostas - I guess I had a thought where all config transactions were submitted on an internal channel (regardless of which channel the config was for). I would assume this is the case for creating a channel?

kostas
2016-11-02 14:37
Yes, we are talking about an internal channel here and these channel requests. The tricky bit though is this:


kostas
2016-11-02 14:37
(And my follow-up messages.)

kostas
2016-11-02 14:38
From all the discussions I've participated in, these system-wide reconfig blocks belong on the system chain.

garisingh
2016-11-02 14:38
right - but does everything that comes in this internal channel get propagated to the channel (if any) it affects? But in any case, could the orderer then just (re)submit the transaction on the appropriate channel (if required) - or is there some trust violation there?

garisingh
2016-11-02 14:39
the system chain which goes to the peers?

kostas
2016-11-02 14:39
The orderer re-submitting is what I referred to as cross-posting.

garisingh
2016-11-02 14:39
yep

kostas
2016-11-02 14:39
i.e. that's how I see it working.

garisingh
2016-11-02 14:39
so we were actually saying the same thing then (I think)

kostas
2016-11-02 14:39
The system chain would go to the peers, I assume, yes.

kostas
2016-11-02 14:39
(Gari: yes.)

jyellick
2016-11-02 14:41
All of this makes me think that we should really redefine the whole 'system chain' concept. I would suggest instead, that we should move to bootstrapping the orderer service separately from bootstrapping the peer network. Essentially bootstrapping the ordering service would give you a thing that can create channels. Then, to bootstrap the peer network, you would just create channels, one with all of the peer members for the 'system channel', but really, that would just be a normal channel.

kostas
2016-11-02 14:42
Jeff had suggested that as well, and I like the idea.

garisingh
2016-11-02 14:43
Well I agree that we need to support that - imagine the case where the people who run the ordering service are not actually the people who have peers which connect

garisingh
2016-11-02 14:43
which is a viable scenario

jyellick
2016-11-02 14:43
That's still possible without, by simply spinning up a new ordering service on demand, but I agree the other mechanism is more elegant

jyellick
2016-11-02 14:44
(Or by spinning up an ordering service with no peer network and reconfiguring it)

jyellick
2016-11-02 14:49
How about this flow: Assuming we have an 'orderer chain', the channel creation transaction comes in as a transaction bound for the desired new channelName/chainID. As one of the configuration parameters, it specifies the orderer chain ID. The ordering service checks the authorization policy on that chain, and if it's valid, wraps and submits this transaction to that chain for ordering. Once the transaction is finally ordered, it is now concretely valid or invalid, and at this point, the channel is created with a genesis block containing the initial transaction.

jyellick
2016-11-02 14:54
As a neat side effect of specifying the orderer chain ID, would be that you could use the same mechanism to spin up 'new orderer chains' on a single backing ordering service. Probably not ideal for dedicated hosting, but for free offerings, it would allow multi-tenancy on a single backing Kafka cluster for instance

kostas
2016-11-02 14:56
I am OK with that.

jyellick
2016-11-02 15:03
Great, sounds like we have a plan then? Are there any ambiguities that should be settled?

muralisr
2016-11-02 15:08
@jyellick the comment above `How about this flow: ….` summarizes the discussion ?

muralisr
2016-11-02 15:08
trying to see how far I should read back

jyellick
2016-11-02 15:09
I would start there and see if you have questions

kostas
2016-11-02 15:10
I'm hesitant to call it a plan just yet, but it looks like it'll work.

muralisr
2016-11-02 15:12
@jyellick @kostas would a diagram with swim lanes be too hard ? :slightly_smiling_face: … (note I hesistate to ask )

kostas
2016-11-02 15:12
@muralisr: I'd argue that you need some context, which is that the current design will need some cross-posting: https://hyperledgerproject.slack.com/archives/fabric-consensus-dev/p1478096773003096

muralisr
2016-11-02 15:12
ok

kostas
2016-11-02 15:13
You want a diagram, or a write-up?

kostas
2016-11-02 15:13
(Bracing myself for a "both" answer.)

jyellick
2016-11-02 15:13
For a little context, the initial problem was that if authorization for creating a chain is stored on one chain, and the actual creation of a chain is stored on another, then these two things are not necessarily serialized, then chain creation would be non-deterministic.

muralisr
2016-11-02 15:16
@kostas , either would be a good start (whichever is easier)…

kostas
2016-11-02 15:16
Write-up it is then, OK.

elli
2016-11-02 15:37
Hi, we are working on a writeup on this :slightly_smiling_face:

kostas
2016-11-02 15:38
The revised flow that eliminates the need for cross-posting?

elli
2016-11-02 15:38
Now getting Binh's comments, but should be done soon

kostas
2016-11-02 15:38

elli
2016-11-02 16:46
Adding @binhn here :slightly_smiling_face:

kostas
2016-11-02 16:54
Chatted with Binh earlier and it seems that in his multi-channel doc he's already working under the assumption of a system chain that is only exposed to the orderers. So we're all on the same page.

binhn
2016-11-02 18:58
ok, finally caught up with this discussion — yesterday i have started to rewrite part of the multichannel and posted a comment about system chain where all peers are on. Basically the conclusion was that on the peer side, there is no system chain, just chains that app creates.

hiepnm
2016-11-03 07:41
has joined #fabric-consensus-dev

hgabor
2016-11-03 11:58

yacovm
2016-11-03 13:32
Hi

yacovm
2016-11-03 13:32
what in the block is signed? the header, or just the body?

yacovm
2016-11-03 13:33
and where is the multi-orderer signature located in?

kostas
2016-11-03 13:42
@yacovm: Have a look at the revised proto file here: https://gerrit.hyperledger.org/r/#/c/2153/2/orderer/atomicbroadcast/ab.proto

kostas
2016-11-03 13:43
The signature you asked for yesterday would go in the BlockMetadata.

kostas
2016-11-03 13:45
In the BlockData you'll find a series of marshaled Envelopes (https://gerrit.hyperledger.org/r/#/c/2153/2/orderer/atomicbroadcast/message.proto) which are signed (by the submitter).

yacovm
2016-11-03 13:47
the block data and meta data is just a list of byte arrays

kostas
2016-11-03 13:48
Yes.

yacovm
2016-11-03 13:48
so how do you interpret stuff?

yacovm
2016-11-03 13:48
you go over the byte arrays and check each one what is it?

jyellick
2016-11-03 13:49
@yacovm For the `Data` they will all be `Envelope`s marshaled

yacovm
2016-11-03 13:49
I'm asking- where does the multi-sig reside

yacovm
2016-11-03 13:49
can you point me?

jyellick
2016-11-03 13:49
Does not exist yet

kostas
2016-11-03 13:49
Have a look at an example here: https://gerrit.hyperledger.org/r/#/c/2177/

yacovm
2016-11-03 13:49
oh....

jyellick
2016-11-03 13:50
However, I strongly suspect, this will simply be a series of `Envelope` messages, which contains a `Payload.data` of the hash of the block header

jyellick
2016-11-03 13:50
And we will define a new `Header.Type` of `BlockSignature`

yacovm
2016-11-03 13:50
I might be asking a foolish question, but- is the header going to be signed? yes or no? (please say yes) because- the seqNum of the block is only in the header

jyellick
2016-11-03 13:51
I expect the signatures to be over the (hash of) the block header

jyellick
2016-11-03 13:51
(So, yes)

yacovm
2016-11-03 13:52
the entire header, right?

jyellick
2016-11-03 13:52
Correct

kostas
2016-11-03 13:52
The hash of the header, so yes.

yacovm
2016-11-03 13:52
ok, just making sure

yacovm
2016-11-03 13:53
but I still don't understand how the metadata extraction works

jyellick
2016-11-03 13:53
Assume for now, that all metadata is of type `Envelope`

yacovm
2016-11-03 13:53
you go over the metadata byte array after byte array

yacovm
2016-11-03 13:53
oh ok

yacovm
2016-11-03 13:53
so it's bytes in the proto file just for convenience

jyellick
2016-11-03 13:53
Correct, in order to build the data hash, the Data really needs to be as bytes

jyellick
2016-11-03 13:53
So, for symmetry, the MetaData is as well

yacovm
2016-11-03 13:54
to avoid marshalling it and demarshalling just for the hash?

jyellick
2016-11-03 13:54
Also to avoid custom marshaling schemes, because proto marshaling is not deterministic

yacovm
2016-11-03 13:55
say what?

jyellick
2016-11-03 13:55
We will have to do custom marshaling for the block header which is unfortunate, but it is small and simple

yacovm
2016-11-03 13:55
if I have a .proto definition you're saying the marshalling isn't deterministic?

jyellick
2016-11-03 13:55
I am

jyellick
2016-11-03 13:56
It is, in implementation always determinstic

kostas
2016-11-03 13:56
(Here we go again.)

jyellick
2016-11-03 13:56
But in documentation, it is explicitly stated that marshaling is not required to be deterministic

yacovm
2016-11-03 13:56
umm ok

tuand
2016-11-03 13:59
scrum ...

2016-11-03 13:59
@tuand has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/ch3ckch4p5bnznvt42qe7fwrdqe.

tuand
2016-11-03 14:01
@simon @hgabor

hgabor
2016-11-03 14:01
having another meeting in parallel, sorry

hgabor
2016-11-03 14:02
today, I reviewed @simon 's SBFT patchsets (we need one more reviewer) and groomed my own ones (reviewers needed and a working Jenkins needed)

hgabor
2016-11-03 14:04
we would need somebody to do https://jira.hyperledger.org/browse/FAB-477 (I do not know those thresholds) Any volunteers / victims?

hgabor
2016-11-03 15:10
noone?

jyellick
2016-11-03 15:38
@hgabor It's something I'd be willing to look at eventually, but sbft is lower priority than channels and chain ACLs etc. at the moment, so I don't have the cycles for it

hgabor
2016-11-03 15:38
it is just 1-2 hours for an experienced reviewer like you :smile:

jyellick
2016-11-03 15:50
Haha, if only that were true, oh the free time I would have

hgabor
2016-11-03 16:19
btw @simon has dozens of commits here waiting for review (and reverify): https://gerrit.hyperledger.org/r/#/c/2117/

jyellick
2016-11-03 16:20
I have looked at all of them I have been tagged on

jyellick
2016-11-03 16:20
Waiting for verify on some

jyellick
2016-11-03 16:20
If you spot any that you think I should review, feel free to add me to the reviewer list

vukolic
2016-11-03 18:37
I'll try to help bu not really this week (@OSDI)

keithsmith
2016-11-03 19:15
I believe each config update transaction will contain the entire config, right? Can someone tell me how to read the latest config update in the peer?

jyellick
2016-11-03 19:30
@keithsmith Yes, every configuration transaction contains the full set of config.

jyellick
2016-11-03 19:31
I'm not sure what you mean by: > Can someone tell me how to read the latest config update in the peer?

keithsmith
2016-11-03 19:40
@jyellick When in the peer, how to read the config data from the system ledger?

kostas
2016-11-03 19:41
Scan all the blocks in reverse order until you find one that matches something similar to what you see here? https://gerrit.hyperledger.org/r/#/c/2179/4/orderer/common/bootstrap/static/static_test.go

jyellick
2016-11-03 19:45
@keithsmith Eventually, there will be a system chaincode at the peer which encodes this information into the normal ledger

jyellick
2016-11-03 19:46
You can use standard query mechanisms at that point.

jyellick
2016-11-03 19:46
If you need to find config from the raw chain before this system chaincode translation, then you can do as @kostas suggests

keithsmith
2016-11-03 19:50
@jyellick Is there a jira item I can follow for the system chaincode to do this?

keithsmith
2016-11-03 19:51
any ETA on that?

jyellick
2016-11-03 19:52
None that I am aware of, this was just hashed out last Thursday I think. It will happen some time after the transaction format is consolidated. This should probably be created in JIRA, though I'm not entirely sure what category to assign it.

jyellick
2016-11-03 19:52
(Or, it exists and I'm unaware of it)

kostas
2016-11-04 14:04
So, when a client issues a Broadcast RPC, I would expect that all envelopes that are getting pushed to that RPC belong to the same channel/`ChainID`. This means that the first envelope to be sent over that session effectively sets the channel for that session. Then, in the case of receiving an envelope that corresponds to a different channel, I want to simply send back a `BAD_REQUEST` response and ignore that message. The other approach, that @jyellick favors, would be to not ignore that message but route it to the proper channel instead. What do we think is the right way to go? @bcbrock and @garisingh may have thoughts on this.

kostas
2016-11-04 14:05
Any potential performance issues aside, I like the idea of asking/expecting the client to issue a different RPC per channel.

simon
2016-11-04 14:08
separate RPC streams for separate channels

simon
2016-11-04 14:08
define channel ID on RPC

kostas
2016-11-04 14:09
That makes more sense to me as well.

simon
2016-11-04 14:09
drop RPC connection when you detect an invalid message

kostas
2016-11-04 14:09
Drop or `BAD_REQUEST` and move on?

kostas
2016-11-04 14:09
The latter is what I do now.

simon
2016-11-04 14:09
return BAD_REQUEST when closing the RPC stream?

kostas
2016-11-04 14:10
Return BAD_REQUEST and keep serving the stream.

kostas
2016-11-04 14:10
Later on, I want to have a threshold that says:

kostas
2016-11-04 14:10
If you issue more than X BAD_REQUESTS during that session, I'm going to close the stream.

kostas
2016-11-04 14:10
But I don't want to get too clever about it.

simon
2016-11-04 14:10
what?

simon
2016-11-04 14:11
that is insane

simon
2016-11-04 14:11
one bad request, drop it

kostas
2016-11-04 14:11
What exactly makes it "insane"?

simon
2016-11-04 14:11
it's way complicated

kostas
2016-11-04 14:12
That I cannot argue against, which is why I said I don't want to get too clever about it. One BAD_REQUEST and the stream drops, it is.

simon
2016-11-04 14:12
i wouldn't even send a response

kostas
2016-11-04 14:12
Heh.

simon
2016-11-04 14:13
i'd close the stream with BAD_REQUEST

simon
2016-11-04 14:13
unless that's not an option to have a single value on rpc close

simon
2016-11-04 14:13
all of this project is way complicated

garisingh
2016-11-04 14:19
well then shouldn't we make Broadcast a unary service rather than a streaming service?

simon
2016-11-04 14:20
i think the idea was to provide a stream of acks

simon
2016-11-04 14:20
garisingh: or do you mean outgoing?

simon
2016-11-04 14:20
i did benchmarks, and RPC setup overhead is significant

simon
2016-11-04 14:21
factor 10 i remember?

jyellick
2016-11-04 14:21
This is one of the shortcoming of the proto service definitions to me. We cannot define 'initial stream parameters', like "This broadcast is for channel X", they must be implicitly set after the fact. So from an API usage perspective, it's non-obvious that you cannot mix trans destined for different chains. My proposal was to say a client _should_ open a stream per channel for broadcast, but as there's no way to prevent the client from simply connecting sending one, disconnecting, and repeating, that there was no real advantage to disallowing mixed chain IDs.

simon
2016-11-04 14:22
so you can tell from the message what channel it goes to?

jyellick
2016-11-04 14:23
Yes

kostas
2016-11-04 14:23
Yes.

simon
2016-11-04 14:23
well then why do you need separate channels at all>?

jyellick
2016-11-04 14:23
> well then why do you need separate channels at all>? I don't follow

simon
2016-11-04 14:23
> So, when a client issues a Broadcast RPC, I would expect that all envelopes that are getting pushed to that RPC belong to the same channel/`ChainID`.

jyellick
2016-11-04 14:25
Every request has a chainID embedded in it. The question is, do we implicitly 'lock' a stream to the first chainID we observe and error on subsequent requests if that chainID changes, or do we tolerate it and route it correctly. ~I think from a Kafka perspective, routing involves building a producer for the channel, and since the client is not disconnecting, it's nonobvious whether to destroy the old producer.~

kostas
2016-11-04 14:26
Not really. I was thinking that from a performance perspective, mixing and matching doesn't really hurt us.

jyellick
2016-11-04 14:26
Oh, my mistake. I assumed it was a Kafka resource allocation/destruction problem.

kostas
2016-11-04 14:27
And you check every envelope for the chainID in its header anyway, so routing it to the right channel (for block creation) is not really a problem.

kostas
2016-11-04 14:27
(At least the way I'm writing it now.)

garisingh
2016-11-04 14:27
so seems there is no reason not to support multiple channels over the same session?

simon
2016-11-04 14:28
so while everybody is here

kostas
2016-11-04 14:28

simon
2016-11-04 14:28
i have only a few days left before i leave

garisingh
2016-11-04 14:28
:disappointed:

simon
2016-11-04 14:28
i think it would make sense to get sbft in and have some people get familiar with it

simon
2016-11-04 14:29
also once gabor's sbft application is in, testers should tear it apart

simon
2016-11-04 14:30
but it takes forever to get any patch in

simon
2016-11-04 14:30
it's ridiculous

garisingh
2016-11-04 14:30
I don't have any issues with trying to get it in. We are doing the same thing with the couchdb stuff on the ledger - may not make v1.0 cut, but still being merged in

simon
2016-11-04 14:30
the core of that sbft application is half a year old

simon
2016-11-04 14:31
block chain without bft...

simon
2016-11-04 14:31
why not just use database replication then

garisingh
2016-11-04 14:35

simon
2016-11-04 14:36
yea somebody merged some commits in the last minutes

simon
2016-11-04 14:38
ah chris ferris is merging

jyellick
2016-11-04 14:39
If you have a particular changeset that is having problems, you can always post to #fabric-pr-review which seems to push review quickly

simon
2016-11-04 14:40
well i guess gerrit is working out exactly as i predicted it would work out

simon
2016-11-04 14:43
the idea of reviews is that people familiar with the code review...

garisingh
2016-11-04 14:57
that's honestly why I have not review the sbft code other than looking through it. I am no expert in that area. Other areas I can be helpful

simon
2016-11-04 14:58
well now it is in

simon
2016-11-04 14:58
and hopefully we can merge the sbft app as well

jyellick
2016-11-04 15:00
I +2-ed because at least once it's in, we can more easily test/hack, but to Chris's point, there is almost no test attached to it

simon
2016-11-04 15:03
because it is glue logic

simon
2016-11-04 15:03
you test it by running the app

jyellick
2016-11-04 15:05
Right, could hook it into the behave tests though

jyellick
2016-11-04 15:05
(Since they already exercise the ab api)

simon
2016-11-04 15:07
for example

simon
2016-11-04 15:07
tho i think behave is way sluggish

simon
2016-11-04 15:08
there must be a reason that the test team used go for scripting their scenarios

garisingh
2016-11-04 15:19
some of the new BDD tests are written in Go now (but using the behave meta language - at least I think that's the case)

garisingh
2016-11-04 15:19
but 100% agree - there were plenty of Go BDD / test frameworks available

simon
2016-11-04 15:20
i call BS on BDD

simon
2016-11-04 15:20
it's what was called integration test before

garisingh
2016-11-04 15:20
:slightly_smiling_face:

garisingh
2016-11-04 15:20
I agree

cbf
2016-11-04 15:22
@simon I disagree - everything should have unit tests, period

cbf
2016-11-04 15:22
we should not be relying on integration/bdd tests to know whether the code is broken or not

simon
2016-11-04 15:22
but we can rely on unit tests?

simon
2016-11-04 15:22
not everything is a nail

cbf
2016-11-04 15:23
note also that I was only submitting the patches that had already been 2+2’ed and there was no comment as to why not merged…

simon
2016-11-04 15:23
different code needs different ways of testing

simon
2016-11-04 15:23
i appreciate the merging

cbf
2016-11-04 15:23
of course, but this project has a lot to learn about testing… our code coverage is just over 50% which for a mature open source project is abysmal

simon
2016-11-04 15:24
well good that this is not a mature open source project

cbf
2016-11-04 15:24
we are also dealing with largely broken bdd tests which are not contributing to validation, so again, major fail

simon
2016-11-04 15:24
yea

simon
2016-11-04 15:25
also the bdd tests took so long to run for me that it was impossible to use them in the development loop

cbf
2016-11-04 15:25
all the more reason to have unit tests that at least help developers know when they are breaking stuff - immediately if they are developing with a watch on the code

simon
2016-11-04 15:25
i totally agree

simon
2016-11-04 15:25
but not every piece of code benefits equally much from a unit test

simon
2016-11-04 15:26
sometimes an integration test works better

cbf
2016-11-04 15:26
we also accumulate significant TD by lack of tests as code drifts yet we have no idea what we are breaking until we hit an edge case that isn’t part of happy path testing that seems to be all that we currently have

simon
2016-11-04 15:27
the sbft code barely has any unit tests

cbf
2016-11-04 15:27
it catches stupid errors faster and again, much of our integration testing is not testing edge cases effectively

simon
2016-11-04 15:27
essentially all tests are integration tests

simon
2016-11-04 15:27
because a distributed system implementation just can't be tested thoroughly with unit tests

cbf
2016-11-04 15:27
only - I agree

cbf
2016-11-04 15:27
I am not asking or saying only

cbf
2016-11-04 15:27
I am saying that all code should have unit tests

simon
2016-11-04 15:28
yes, that's where i disagree

cbf
2016-11-04 15:29
this is not a closed source project - it is open source… code will come from many sources - in fact, we want that

cbf
2016-11-04 15:29
and when developers see code without unit tests, it gives them little confidence and often they don’t engage

simon
2016-11-04 15:30
ah?

simon
2016-11-04 15:30
do you have any links that support that claim?

cbf
2016-11-04 15:30
look at any other mature open source project - they will have test coverage in the high 80 or low 90s

cbf
2016-11-04 15:30
this is human nature

simon
2016-11-04 15:30
but this is not a mature project!

cbf
2016-11-04 15:30
that is evident

simon
2016-11-04 15:30
the design isn't even finalized

simon
2016-11-04 15:34
well, the sbft code has almost no unit tests, but has almost complete test coverage safe error returns/panics

simon
2016-11-04 15:34
all done via integration tests

simon
2016-11-04 15:35
do you claim it still needs unit tests?

simon
2016-11-04 15:38

donovanhide
2016-11-04 15:40
http://jepsen.io/analyses.html Maybe ask this chap :slightly_smiling_face:

simon
2016-11-04 15:41
maybe first build the system until we're confident that it works flawlessly, then get our dreams crushed by jepsen

donovanhide
2016-11-04 15:42
crushed _in public_

simon
2016-11-04 15:42
same thing

simon
2016-11-04 15:47
well, i'm out for the next week

simon
2016-11-04 15:48
good that we got the sbft stuff in

yacovm
2016-11-04 15:57
That's the aphyer guy?

kostas
2016-11-04 15:58
Yes.

yacovm
2016-11-04 15:58
He had a really funny profile pic on github

yacovm
2016-11-04 15:59
Why can't we have integration tests in unit tests?

yacovm
2016-11-04 15:59
In gossip i have tests that spawn 20 nodes and kill some of them, check they all replicate information, etc.

yacovm
2016-11-04 16:00
If it was easty to configure fabric peers in pure go that could have been done

simon
2016-11-04 16:02
yea but that's not a unit test

simon
2016-11-04 16:02
that's an integration test

yacovm
2016-11-04 16:03
Yeah i mean- that they will run as a unit test

simon
2016-11-04 16:03
well they run in go test, but i wouldn't call it a unit test

donovanhide
2016-11-04 16:04
Go’s http has an interesting package that makes testing easier: https://golang.org/pkg/net/http/httptest/ Seems like a good model to emulate?

james
2016-11-04 16:15
has joined #fabric-consensus-dev

hgabor
2016-11-04 17:28
For the glue code in sbft app, I will try to add tests next week. Hope I will have enough time to finish them

jonathanlevi
2016-11-04 18:50
has joined #fabric-consensus-dev

jonathanlevi
2016-11-04 19:51
Just to drop a line here: I am of the opinion that in the long run, tests accelerate development time/time to market. It builds confidence in existing code/modules/helps updating/extending/refactoring and re-using…

jonathanlevi
2016-11-04 19:52
Even if the code is not “mature” or in “production” yet… or some would say: especially when the code is not “mature” or in “production” it should be tested as much as possible.

jonathanlevi
2016-11-04 19:52
Unit-tests are really great. Especially during/for development/developers.

jonathanlevi
2016-11-04 19:53
OK, OK, 2 lines (if you have like a 65” monitor probably :wink:. Not true for my laptop).

garisingh
2016-11-05 10:06
https://gerrit.hyperledger.org/r/#/c/2039/6 - want to get one of your opinions. I like the fact that reuse is being attempted here, but want to make sure folks are okay with this versus moving some functionality into the common package

jyellick
2016-11-06 05:02
Commented on Gerrit

frankyclu
2016-11-06 05:58
this should be a fairly easy fix, but probably worth do it sooner since txs will get dropped w/o getting caller notified @jyellick @tuand


oiakovlev
2016-11-07 10:47
I have a question about chaincode determinism - normally to make chaincode deterministic, we should not use date time API, right? But I believe this is some sort of common use case: accounts can have expiration time and accordingly to the user requirements transfer should not be made to the non active account (accounts are also stored in KVS) and the check is part of chaincode contract.... So question here: how to avoid non determinism with such approach as the check would involve date time API call? Any thoughts here?

frankyclu
2016-11-07 10:55
you can use txs to put current time in your world state, probably triggered by a separate clock service that runs periodically. this is valid since in many countries you have a central trusted time service that financial systems retrieve date/time from

frankyclu
2016-11-07 10:57
speaking of determinism, I think there is a interesting case where if you have multiple putStates called at different order by different chaincode, even if they are gurantted to be the same on the db, the hash calculated from delta would still be different @sheehan

oiakovlev
2016-11-07 11:03
how I understand determinism - is that if I replay tx from the same state - I should get the same result, this seems to be an idea of blockchain, If I have date time check in my chaincode - this might be not true. For the second sentence, you wrote - I'm not sure that I got it. If this is true - this sounds like a bug as transactions are ordered and should give the same result on each peer, if they have deterministic code...

frankyclu
2016-11-07 11:17
for the first sentence, you use a time provider service to write time to your blockchain, thi s no different from another invoke tranasction, which will get your time synchonized to all peers. other peers will use the time stored in the world state of your chaincode rather calling any of the date/time functions

oiakovlev
2016-11-07 11:37
so external provider will give me an option of the same date-time on all the peers, which is not something I'm really have doubts about - I can imagine that all my hosts running peers are in sync, and having the same system time. The doubts I have - is that when I replay the same transactions in few month let's say - I'll got different results. As invocation has no (and should not have to have) time in the parameters.

oiakovlev
2016-11-07 11:38
or maybe I didn't get the idea you propose..?

jyellick
2016-11-07 14:26
@oiakovlev The idea would be that you use the time that is on the chain, so that when you replay the chain, you are also replaying the time. And since all peers have the same chain, they have the same time.

yacovm
2016-11-07 14:29
hey Consensus guys. I'm starting to plan the gossip aspect of the multi channel stuff

yacovm
2016-11-07 14:29
remember that code that you said is going to be "librarized" (made into a library)

yacovm
2016-11-07 14:30
that allows me to find for a block- for what organizations it can be sent to

yacovm
2016-11-07 14:30
and to know which organization is a certain peer's PKI-id

yacovm
2016-11-07 14:30
?

oiakovlev
2016-11-07 14:31
@jyellick - thanks, so the idea that my chaincode is invoked and for the first invocation stores the time it was invoked against, so each following invocations will reuse that time, that's what I was thinking about... as well...

jyellick
2016-11-07 14:33
@oiakovlev In the case of an oracle-type service, which periodically writes the time to the chain, you could always reference the value exposed by that service in your chaincode.

jyellick
2016-11-07 14:33
@yacovm Yes, I know to what you are referring

yacovm
2016-11-07 14:34
so, any idea if that's written yet?

oiakovlev
2016-11-07 14:34
so why while replaying the tx oracle service will write me the old time?

yacovm
2016-11-07 14:34
or in progress?

jyellick
2016-11-07 14:36
@oiakovlev The oracle-service writes it to the chain, so when replaying, you also replay the writes of the oracle service, so after you replay block 11, your 'simulated time' corresponds to the time encoded by the oracle onto the chain at block 11. Essentially, the key is not to depend on a system clock, but rather to depend on a 'chain clock'.

jyellick
2016-11-07 14:37
@yacovm I'd say it is in progress. I'll also point out, per discussions with @binhn @muralisr their preference was to write a translation layer which converted the configuration transaction into the normal MVCC type data so that it could be queried and utilized with all the normal peer tools, you might want to simply pull it from there once that translation layer is completed.

yacovm
2016-11-07 14:38
can you point me to the data type please? (where is it in the code)

oiakovlev
2016-11-07 14:38
@jyellick yes... make sense... the only thing which seems to be unclear: at the point I replay I don't have `block 11`, right? how oracle service knows what to put there at the time of replay?

jyellick
2016-11-07 14:39
@yacovm The short answer is, there's some code in development for it, but nothing committed, and nothing stable enough I'd be comfortable suggesting you work off of.

yacovm
2016-11-07 14:40
ok, then how about the following course of action? I'll create a go interface that could be implemented in the future and use that for the code, and I'll share it with the people that are supposed to implement it and they acknowledge/disagree the proposed API?

yacovm
2016-11-07 14:40
I just want to have something to hold on to, so I could progress

jyellick
2016-11-07 14:40
@oiakovlev Your chaincode never interacts with the oracle service. Your chaincode interacts with the data the oracle service put onto the chain. In that sense, when you replay, you are only utilizing information on the chain (which you do have), and do not need to interact with the oracle service. The oracle service is responsible for putting the data onto the chain, then it is done.

frankyclu
2016-11-07 14:40
just some caution on MVCC, it is likely to be an unwanted to feature for financial use cases, where strong consistency is almost always a pre-requirement and you can't have txs failed later without getting caller noticed

jyellick
2016-11-07 14:41
@yacovm Sounds like a good plan

yacovm
2016-11-07 14:41
can you point me at the people I need to share it with? :slightly_smiling_face:

jyellick
2016-11-07 14:42
@tuand is the one working on populating the genesis block, but @keithsmith and @muralisr might also have some interest. I'd also like to be copied.

tuand
2016-11-07 14:56
scrum ...

jyellick
2016-11-07 14:59
link?


tuand
2016-11-07 15:15
folks who didn't make the scrum please post your 1 or 2 liner summary here

lhaskins
2016-11-07 15:25
We are still running tests for v0.6 as we get new loads (new load received Fri afternoon; possibly another very soon) and we are getting started on tests for v1.0 For more details see https://jira.hyperledger.org/browse/FAB-531 and https://jira.hyperledger.org/browse/FAB-979

yacovm
2016-11-07 17:37
OK, I defined an API that needs to be implemented by the upper layers that use the gossip (basically, the peer, ledger layer): https://gerrit.hyperledger.org/r/#/c/2325/ I'd like you guys { @tuand , @keithsmith , @muralisr , @c0rwin } to take a look and tell me what you think

markparz
2016-11-07 17:38
bunch of new videos posted on #playbacks and our youtube channel at https://www.youtube.com/channel/UCCFdgCWH_1vCndMPVqQlwZw Check them out and subscribe… including tool to generate the genesis block

tuand
2016-11-07 21:04
@yacov, can you add the people on your list as reviewers to 2325 so gerrit can do notifications ? @binhn also ...

yacovm
2016-11-07 21:07
I'll assume you meant to tag me, and yes

yacovm
2016-11-07 21:07
in a minute, I'll push a modified change set, @jyellick you win this time :slightly_smiling_face:

yacovm
2016-11-07 21:24
added

garisingh
2016-11-07 23:47
@jyellick - your changes are merged

echenrunner
2016-11-08 01:21
"[FAB-707] disconnected Peer can't recover from lost connection, then start sending view." I took a look at the code pbft-core.go and it seems in the "stateUpdatedEvent" where update.seqNo is far less than instance.seqNo can cause loop issue. that is because instance.Checkpoint(update.seqNo, update.id) will store a values so low (vc.H ) that will cause viewchange go into "a loop like" thus cause nothing get updated. I play around it a bit

frankyclu
2016-11-08 05:38
@echenrunner the issue has already been concluded as a design "intent/gap" if you read the comment section, not a code issue; hopefully the suspect message will be implemented in sbft @kostas to permanently solve the problem

hgabor
2016-11-08 08:02
@cbf and everybody about https://gerrit.hyperledger.org/r/#/c/2037/ and unit testing: I am creating tests for connection.go, persist.go, crypto.go. main.go is tested in the next commit.1218 lines are already covered with tests (see backend test, and the next commit, and the rest are config files, structs, protos). however, I won't write tests for functions which are just calling grpc.Dial.

echenrunner
2016-11-08 11:08
the issue is when the Server comes up from a "stop". the logs of all 4 peers shows "H" stay within the SeqNo befor it comes down. It's when it comes back up again where update.seqNo/instance.seqNo causing the LOOP. the issue might be which value to use. the code has "instance.lastExec = update.seqNo from the ckpt but not instance.seqNo.

jyellick
2016-11-08 13:56
@echenrunner With respect to FAB-707, this is expected behavior. The `stateUpdatedEvent` does deliberately loop when the state update comes back with with a state update which it too farin the past. I think you will find that `instance.seqNo` is only used when the peer is the new PBFT primary, where it is appropriately set after a view change, so this is not necessary to set after state transfer.


yacovm
2016-11-08 16:07
@jyellick so, back to the multi-channel support API design, since aso said the primitives I require will be there, can we keep the PKI-id in the method signature? That *really* simplifies things to me

jyellick
2016-11-08 16:14
What about instead a ``` CertByPKIid(pki PKidType) identity []byte ``` provided by the crypto folks, and then a: ``` Authorized(chainID []byte, identity []byte) ``` Provided by the orderer originated lib

yacovm
2016-11-08 16:47
was on the phone with adc and aso. identity you mean cert?

yacovm
2016-11-08 16:47
if yes then sure

echenrunner
2016-11-08 18:20
With respect to FAB-707, This issue might not be related to pbft. with two quick change to the pbft-core.go seems to correct the looping and able to continue to create blocks. my concerns if this was in production, and one of the VP is in this loop how do we correct the issue before the second, third one decided to bring there peer down because of DR testing or weekend outage?

jyellick
2016-11-08 18:22
@echenrunner The design of PBFT is such that it tolerates up to f peers being out of sync. If another peer were to fail (here where f=1, n=4), then the network would resync and the failed 'looping' peer would pick back up and function normally.

cca
2016-11-09 08:03
@echenrunner - You have to make up your mind about the system assumption. The protocol as implemented will halt when a second peer is down because it is designed to tolerate only 1 failing node (f=1). If you want it to continue with two down (f=2), you need n=7=3f+1 nodes. Alternatively, one could reconfigure the group (through a transaction that is ordered) and eliminate a node for which some trusted entity (or a majority of the nodes) confirm it is failed and must be replaced. This is called "reconfiguration" and available in systems like BFT-SMaRT (http://www.di.fc.ul.pt/~bessani/publications/dsn14-bftsmart.pdf). The design of this is known but not yet implemented here. I believe it is important to have such a reconfiguration method implemented.


yacovm
2016-11-09 12:28
is `ab.proto` changed to accommodate multi channel support? (it doesn't seem so to me, but I want to check) If not- can someone please describe what will be the changes? is another item going to be added to `SeekInfo`?

garisingh
2016-11-09 12:50
should be a `ChainID` parameter in there now?

yacovm
2016-11-09 12:52
oh, didn't see that, thanks :slightly_smiling_face:

yacovm
2016-11-09 12:52
and where is the credentials of the organization?

yacovm
2016-11-09 12:54
I thought the orderer needs to be sure that the peer is in the right organization to give out blocks from the chain

c0rwin
2016-11-09 14:15
Now you submit the `Envelope` to the orderers: ```// Envelope wraps a Payload with a signature so that the message may be authenticated message Envelope { // A marshaled Payload bytes payload = 1; // A signature by the creator specified in the Payload header bytes signature = 2; }```

yacovm
2016-11-09 14:17
the signature on the envelop is the signature of who? the peer that calls deliver?

c0rwin
2016-11-09 14:18
I guess, so. @jyellick, @muralisr Can you please confirm?

jyellick
2016-11-09 14:20
The signature is from the `creator` specified in the `Payload.header`

jyellick
2016-11-09 14:20
And this should be from the client who calls `Broadcast` (not `Deliver`)

yacovm
2016-11-09 14:21
but I asked about `Deliver`

yacovm
2016-11-09 14:21
there is no `envelope` there

jyellick
2016-11-09 14:22
> the signature on the envelop is the signature of who? the peer that calls deliver? But you asked about the Evelope signature?

yacovm
2016-11-09 14:23
true, true

yacovm
2016-11-09 14:23
I was confsued

yacovm
2016-11-09 14:23
can you explain then where do the credentials pass in the Deliver?

jyellick
2016-11-09 14:27
The authentication for `Deliver` is done at the TLS handshake

yacovm
2016-11-09 14:28
isn't it a problem? I thought that it is not necessary that the SSL cert is related to the COP cert, so how do you associate the invocation to the peer's org?

jyellick
2016-11-09 14:30
My working assumption has been that TLS certs will have a chain of trust to a peer org CA. I'm not sure why this would not be the case.

garisingh
2016-11-09 14:45
@jyellick - so you are assuming mutual TLS (related to our discussion yesterday) for communication with the orderer nodes?

jyellick
2016-11-09 14:45
Correct

kostas
2016-11-09 14:47
How else would it work @garisingh?

kostas
2016-11-09 14:48
s/would/could

garisingh
2016-11-09 14:48
signature on the DeliverUpdate message

garisingh
2016-11-09 14:49
here's the thing - in the prior architecture, identity is always extracted from the signing certificate

garisingh
2016-11-09 14:49
I am not opposed to mutual TLS

garisingh
2016-11-09 14:49
but we have never done mutual TLS to date

jyellick
2016-11-09 14:49
Without mutual TLS auth, it introduces us to some other unpleasant problems, like unauthorized clients calling the RPC then going silent. The auth at the upper layer simplifies things a bit, I think.

garisingh
2016-11-09 14:49
anywhere in this project

garisingh
2016-11-09 14:49
I tend to agree

garisingh
2016-11-09 14:50
but you'll still need to do the authorization check above the TLS layer

garisingh
2016-11-09 14:50
TLS will just filter out untrusted connections

kostas
2016-11-09 14:51
So you're saying that it's unavoidable and we should also be asking for a signature on the DeliverUpdate messages? Why?

garisingh
2016-11-09 14:52
oh - sorry - I tend to agree with using mutual TLS

jyellick
2016-11-09 14:52
Certainly, for `Broadcast` and `Deliver` with channels we now have to make sure the call is authorized for the particular chainID

garisingh
2016-11-09 14:52
right

garisingh
2016-11-09 14:52
So - mutual TLS filters out untrusted connections - meaning you can't just hang around if you are not allows to connect at all

yacovm
2016-11-09 14:53
How does the orderer learn of the certificate of each peer then?

jyellick
2016-11-09 14:53
On `Broadcast` we _could_ filter by both TLS cert and signing cert, or just signing cert, but we must verify the signing cert.

garisingh
2016-11-09 14:53
you'd still need to "propagate / forward" the client certificate to the Deliver call and check that they were authorized for the channel. Now a nice model might be to actually use the GRPC interceptors to do this

jyellick
2016-11-09 14:54
@yacovm In the TLS handshake, the certificates are exchanged.

garisingh
2016-11-09 14:54
is Layer 4 versus Layer 7

garisingh
2016-11-09 14:54
TLS is layer 4

yacovm
2016-11-09 14:54
what about the hostname/ip issue? in the client certificate

jyellick
2016-11-09 14:54
Pretty sure @simon's TLS code intercepted the cert, would need to look again.

garisingh
2016-11-09 14:54
I think he did it that way

garisingh
2016-11-09 14:55
no issue with hostname / ip on the client cert

yacovm
2016-11-09 14:55
oh? really?

yacovm
2016-11-09 14:55
only the server?

garisingh
2016-11-09 14:55
right

kostas
2016-11-09 14:55
@jyellick: You cannot extract this in the application layer IIRC. @yacovm Wasn't that the point you were making when we had that call?

garisingh
2016-11-09 14:55
so TLS client will verify the hostname of the server

kostas
2016-11-09 14:56
I remember exploiting the TLS Unique property to do some certificate pinning and associate one cert with another. But I might be wrong.

garisingh
2016-11-09 14:56
server will verify client cert against its trust store

yacovm
2016-11-09 14:56
Yeah Gari you're right- I totally forgot what was the purpose of that in the first place :upside_down_face:

garisingh
2016-11-09 14:56
(I spent years dealing with TLS in DataPower)

garisingh
2016-11-09 14:56
and all of its fun and exciting issues

yacovm
2016-11-09 14:58
@kostas you can if you configure the client in a different way

yacovm
2016-11-09 14:58
when I said you can't I simply used fabric 0.5 with SSL on and saw you can't

yacovm
2016-11-09 14:58
but Simon did something else entirely

yacovm
2016-11-09 14:59
I mean, I haven't seen it- but I saw a commit simon did so I guess you can

garisingh
2016-11-09 15:00
on another point that @jyellick made as well - the "out of the box" plan for mapping peer certificates to "participants" (since a participant may have multiple peers) was to use one of 2 mechanisms: 1) mapping of participant to root / intermediate authority 2) if all certs issued by same root authority use something in the DN to distiguish particpants

garisingh
2016-11-09 15:00
of course you could use individual certs as well

garisingh
2016-11-09 15:01
we are going to need a common piece of crypto code for peers and orderers which implements this logic

garisingh
2016-11-09 15:01
we REALLY need to write this all down somewhere :disappointed:

garisingh
2016-11-09 15:02
@jyellick and @kostas - I'm in your neck of the woods early next week (M-W) if we want to draw it up on the board. We'll be working through similar issue(s) for the endorsers, etc

kostas
2016-11-09 15:05
So based on what I read here, you'll need your own TransportAuthenticator to extract the TLS client certificate. https://github.com/grpc/grpc-go/issues/111

garisingh
2016-11-09 15:06
so maybe start with the signature approach :wink:

kostas
2016-11-09 15:07
At any rate, it seems that you're all considering this as a solved problem, so I am good.

kostas
2016-11-09 15:07
Right.

garisingh
2016-11-09 15:07
depends on how much code you want to write :wink:

garisingh
2016-11-09 15:07
we can make it work either way

garisingh
2016-11-09 15:07
frankly, again we should have common GRPC server code as well

jyellick
2016-11-09 15:07
Attaching a signature to `Deliver` would definitely not be much work. But, if we are going to require mutual TLS, seems like we would be making the client do redundant work.

garisingh
2016-11-09 15:09
my initial take would be to forget about mutual TLS for now and go with the signature approach since that's the way it's being done in the peer code for now as well

garisingh
2016-11-09 15:09
@yacovm does it in gossip as well based on the discussion yesterday?

yacovm
2016-11-09 15:10
I just want to understand something- the server side (orderer) would implement its TLS server transport in a way that walks on the certificate chain up until the CA, and *BOTH* would need to check for CRLs right? @garisingh ?

yacovm
2016-11-09 15:10
In gossip I don't rely on signed certificates, it works with self-signed certificates because I have a challenge-response in the handshake when a peer connects to you (2-way handshake)

yacovm
2016-11-09 15:11
the certificates used in the challenge-response are the application layer's

garisingh
2016-11-09 15:11
I meant you use signatures anyway

garisingh
2016-11-09 15:11
no reliance on mutual TLS for identification

yacovm
2016-11-09 15:11
you mean no reliance on signed certs?

garisingh
2016-11-09 15:14
sorry - you have a "handshake" mechanism which used signed messages to mutually identify each other. You don't rely on mutual TLS for identification at the gossip layer

yacovm
2016-11-09 15:14
I don't rely on the TLS certificates for anything

yacovm
2016-11-09 15:14
just on the application certs

vukolic
2016-11-09 16:34
after discussion with @hgabor the idea is to proceed as follows with enabling simplebft to run efficiently on WANs

vukolic
2016-11-09 16:35
https://jira.hyperledger.org/browse/FAB-897 involves invasive changes to how simplebft (sbft) works

vukolic
2016-11-09 16:36
so the goal is to have a separate implementation (called e.g., pipelinedPBFT) based on simplebft fork

vukolic
2016-11-09 16:36
sbft could still be used in clusters and pipelinedPBFT could be used in WANs

vukolic
2016-11-09 16:37
solution to https://jira.hyperledger.org/browse/FAB-897 will be implemented against the same interfaces (System/Receiver) that sbft defines

vukolic
2016-11-09 16:38
so orderer/sbft will remain practically unchanged (except for refactoring to enable use of different consensus implementations)

vukolic
2016-11-09 16:38
hence https://jira.hyperledger.org/browse/FAB-897 is out of simple BFT epic

vukolic
2016-11-09 16:40
essentially we would have at least two bft core consensus components

vukolic
2016-11-09 16:43
@simon @jyellick @kostas @garisingh ^^^

kostas
2016-11-09 16:46
Noted, skimming through the linked paper now.

jyellick
2016-11-09 16:47
I'm a little wary of trying to make sbft too plugable. We tried that with 'classic'/'batch'/'sieve' last time, and I think ultimately it caused a lot more headache than anticipated. Obviously can be done, will just have to be careful how we architect this.

vukolic
2016-11-09 16:48
it seems actually very nice - because @simon definitions of System/Receiver interfaces are entirely non-sbft or even non-pbft specific

vukolic
2016-11-09 16:48
and we would reuse @hgabor's work on orderer

vukolic
2016-11-09 16:48
as well as a future component for https://jira.hyperledger.org/browse/FAB-474

vukolic
2016-11-09 16:49
also, as implementing https://jira.hyperledger.org/browse/FAB-897 may take some time - we want to have sbft alive and kicking in the meantime

vukolic
2016-11-09 16:51
I would not change much/anything in interfaces defined now in https://github.com/hyperledger/fabric/blob/master/consensus/simplebft/simplebft.go

vukolic
2016-11-09 16:52
just refactor this to have a reusable API outside simplebft package

echenrunner
2016-11-10 00:25
Thanks for the Information on "reconfiguration". The problem occurs when I bring down all 4 of my VPs and then bring them back again. There are other test where I did multiple time of bring down and up one VP at the time and eventually it will get out of sync as well. I put a temporary fix to go around the issue. But the ideal of "reconfiguration" is something they should look at.

garisingh
2016-11-10 11:47
@binhn @jyellick @kostas - so I think we need to make a call on terminology in terms of "channels" versus "chains". It seems we use the term "channels" in discussions, etc but that within the implementation we are using the term "chain" (e.g. chainID, getDefaultChain, etc). We need to pick one

yacovm
2016-11-10 11:54
good morning. Where in the code (if it is already coded, I grep-ed the whole project and doesn't look like it is) can I find how the `JoinChannel` message looks like?

garisingh
2016-11-10 11:59
it's hard to Join something which does not exist yet :wink:

garisingh
2016-11-10 11:59
aka it's not coded up yet

garisingh
2016-11-10 12:00
(at least as far as I know and last time I looked as well)

yacovm
2016-11-10 12:22
@garisingh any idea who's the person coding it? I have a requirement to ask

yacovm
2016-11-10 12:23
Maybe @muralisr ?

garisingh
2016-11-10 12:25
would likely be Murali.

yacovm
2016-11-10 12:29
ok thanks. @jyellick I uploaded a new patch set, after discussing things related to the multi channel internally this morning. Could you please take a look, and hopefully remove the `-2`? :thinking_face: https://gerrit.hyperledger.org/r/#/c/2325/12/gossip/api/channel.go

binhn
2016-11-10 12:47
@yacovm join channel is a call to the configuration system chaincode cscc with the genesis block and a list of peers; cscc also processes config changes coming in from gossip as a block (configuration block) — i am working on cscc

binhn
2016-11-10 12:49
@garisingh chain=channel+ledger+some participants

binhn
2016-11-10 12:51
so from sdk, we create a chain, and internally we create a channel, add participants (peers), who create ledger to hold data on that channel

garisingh
2016-11-10 12:55
yeah - I get that - but in the messages sent to the ordering service "channels" are addressed by "chainID" and in reality on the orderer side a channel has participants/ACLs.

yacovm
2016-11-10 13:00
@binhn ok, then I need the following abilities in order to support multiChannel: https://gerrit.hyperledger.org/r/#/c/2325/12/gossip/api/channel.go

yacovm
2016-11-10 13:01
I need a JoinChannel to be a concrete object, signed by the app, and I also need it to have a timestamp, and to be able to extract the list of peers certificates from it. we're going to gossip this message around in the org, and I need the timestamp to know which JoinChannel message is newer than another

hgabor
2016-11-10 13:04
all: today I have some non-consensus related tasks to do but I will continue with making System/Receiver be parts of some non-sbft common API to enable other orderers to use them ( @vukolic @kostas @jyellick etc.)

jyellick
2016-11-10 13:54
> so I think we need to make a call on terminology in terms of "channels" versus "chains". It seems we use the term "channels" in discussions, etc but that within the implementation we are using the term "chain" (e.g. chainID, getDefaultChain, etc). We need to pick one @garisingh Personally, my vote is we say 'chains' rather than channels. I don't see any way we can ever get away from the word 'chain' in a blockchain implementation, and since channel <-> chain is one to one, I don't really see the advantage in introducing the new term.

k.sung
2016-11-10 13:57
has joined #fabric-consensus-dev

jyellick
2016-11-10 13:58
> I uploaded a new patch set, after discussing things related to the multi channel internally this morning. @yacovm Removed the -2 and commented.

binhn
2016-11-10 14:38
@yacovm i’ll loook at channel.go since cscc will call it to pass info — meantime, take a look at the protos/common where the genesis block comes in with 1 transaction of type CONFIGURATION. The transaction has a ChainHeader which contains a timestamp

yacovm
2016-11-10 14:38
of course, I don't need you to look at the other file.

yacovm
2016-11-10 14:41
so it's Payload-->Header-->ChainHeader-->timestamp. SignatureHeader is the signature on the header? where are the actual bytes of the signature then?

jyellick
2016-11-10 14:41
The actual signature bytes are in the `signature` field of the `Envelope`

jyellick
2016-11-10 14:42
They are over the `payload`, which includes a `Header` which embeds a `ChainHeader` (which contains the chainID et al) and a `SignatureHeader` (which contains the identity, nonce, et al)

jyellick
2016-11-10 14:43
The reason the `Header` was split into the `ChainHeader` and the `SignatureHeader` is because there are cases where a message wants to share a `ChainHeader` among several signatures, but needs a unique `SignatureHeader` per signature.

yacovm
2016-11-10 14:49
so the configuration transaction is essentially going to be made to a single block (of 1 transaction), and this block is going to be passed to the peers, which would: 1) Verify the signature over the envelope 2) turn the payload bytes to Payload 3) turn the data of the Payload to a Block and from there read the configuration?

tuand
2016-11-10 14:51
that's how I'm building the block for bootstrap ... haven't done anything for signature verification yet

yacovm
2016-11-10 14:51
I don't need the code to be written, all I need is for the right people to endorse my commit and promise that the capabilities I need for gossip multi channel support will be there

jyellick
2016-11-10 14:55
> 3) turn the data of the Payload to a Block and from there read the configuration? No. The `Payload` contains `data` of type `ConfigurationEnvelope`, not an embedded block.

jyellick
2016-11-10 14:55

jyellick
2016-11-10 14:56
Configuration blocks are blocks which only ever contain a single transaction, whose chain header type is `CONFIGURATION_TRANSACTION`

kostas
2016-11-10 14:56
I would also refer you to the static bootstrapper changeset as it provides a concrete example of how this looks like.

jyellick
2016-11-10 14:57
The semantics changed slightly after some feedback from @binhn so I would encourage you to look it the static bootstrapper in this pending CR https://gerrit.hyperledger.org/r/#/c/2371/

tuand
2016-11-10 14:57
scrum ...

2016-11-10 14:58
@tuand has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/terjim7t4vhi5mqq4boqcqpytqe.

yacovm
2016-11-10 16:05
1) all right, I looked at the google doc @jyellick showed me, I assume at some point the configurationItem will have a list of peers of something like that? because it's not coded, right? 2) in the gerrit item you linked- I'm supposed to look at the test, right?

jyellick
2016-11-10 16:06
> at some point the configurationItem will have a list of peers of something like that The configuration will have a list of peer orgs, never peers. > in the gerrit item you linked- I'm supposed to look at the test, right? You should look at `fabric/orderer/common/bootstrap/static/static.go` to see an example of the actual configuration being encoded.

jyellick
2016-11-10 16:07
^ @yacovm

yacovm
2016-11-10 16:08
this isn't what I understood from @binhn , I was under the impression that the JoinChannelMessage given to each peer will have the list of *all peers in the channel*

yacovm
2016-11-10 16:08
This is *critical* to implement the multi channel support in the current proposed flow

tuand
2016-11-10 16:08
at some point (hopefully soon) there will be a configurationItem whose value is a list of peer CA certs

tuand
2016-11-10 16:09
ordering service uses that list to determine who can connect

yacovm
2016-11-10 16:09
@tuand , I understand that the ordering service only cares about the org

yacovm
2016-11-10 16:09
but we were said that the peers themselves know each peer in the channel

jyellick
2016-11-10 16:09
The `JoinChannelMessage` may have the list of all peers, I am only speaking to the contents of the configuration transaction

tuand
2016-11-10 16:09
correct

jyellick
2016-11-10 16:10
Who is allowed to transact on the chain is a property of chain config. Who is actively transacting and interested in the chain is not a property of the chain, this is a property of the peer network. It does not belong in the chain config.

tuand
2016-11-10 16:10
configurationItem , not configurationID

yacovm
2016-11-10 16:10
But binh said: ``` join channel is a call to the configuration system chaincode cscc with the genesis block and a list of peers; ```

jyellick
2016-11-10 16:10
Exactly

jyellick
2016-11-10 16:11
With `the genesis block` what you see there, _and_ `a list of peers`

jyellick
2016-11-10 16:11
The list of peers does not come from the genesis block.

yacovm
2016-11-10 16:11
but a system chaincode produces a transaction... I'm confused now. isn't what's being produced is that configurationItem?

jyellick
2016-11-10 16:13
To my knowledge, the application creates the configuration transaction for a new channel, submits it to ordering, and gets back the genesis block. It then decides which peers are going to actually use the channel, and sends them each a `JoinChannel` with that genesis block, and the list of peers it has concluded should be on the channel.

yacovm
2016-11-10 16:13
so the genesis block is the configuration?

yacovm
2016-11-10 16:13
oh

jyellick
2016-11-10 16:13
The genesis block is the first block in the chain, it embeds a single transaction of type `CONFIGURATION_TRANSACTION` and carries as its data a `ConfigurationEnvelope` which holds an arbitrary number of `SignedConfigurationItem`s (each of which embeds a `ConfigurationItem` which contains whatever configuration information, like peer orgs and policies)

yacovm
2016-11-10 16:13
I see now, many thanks. This isn't good to me :disappointed:

yacovm
2016-11-10 16:14
I need the peer list to be signed

yacovm
2016-11-10 16:14
I mean, I want it to be signed

jyellick
2016-11-10 16:14
Why can the app not sign it?

jyellick
2016-11-10 16:14
How did you know to trust the JoinChannel?

yacovm
2016-11-10 16:14
I was under the impression that the app will sign it,

yacovm
2016-11-10 16:14
I just need *someone* to do that, so I could leverage that signature

jyellick
2016-11-10 16:15
From what I've just heard, and, I have not been involved in these discussions. My impression was that the app signs over the JoinChannel, which includes the peer list and the genesis block. Whether it signs each, or both, or some other combination, I don't know.

jyellick
2016-11-10 16:16
I only wish to stress that it is _not_ the configuration transaction which contains the peer list

jyellick
2016-11-10 16:16
(And therefore it is _not_ the genesis block which contains this information)

c0rwin
2016-11-10 16:17
Application signs the list of peers, while GB (genesis block) contains info about org rather than peers

c0rwin
2016-11-10 16:18
@c0rwin uploaded a file: https://hyperledgerproject.slack.com/files/c0rwin/F310Z4666/pasted_image_at_2016_11_10_11_18_am.png and commented: When we draw a sequence on board it looked something like this:

c0rwin
2016-11-10 16:19
at very high level (w/o deep details)

jyellick
2016-11-10 16:20
Thanks @c0rwin that is how I envisioned it, but it's great to have a real flow diagram and confirmation

tuand
2016-11-10 16:21
attach the diagram to the jira issue ? before slack loses it for us

garisingh
2016-11-10 16:27
@c0rwin - so is there an implied step in this diagram where all the peers that belong to an organization connect to each other via the gossip layer?

c0rwin
2016-11-10 16:28
diagram depicts joinChannel call per single peer, so I guess it doesn’t include the logic where all org peers get connected together

garisingh
2016-11-10 16:29
and follow-on - does that mean that basically the gossip layer then dynamically builds up the list of peers grouped by organization?

c0rwin
2016-11-10 16:29
@tuand can you tell me the JIRA item, you’d like me to attach the diagram?

c0rwin
2016-11-10 16:30
@garisingh my answer would be - yes, while we can confirm that w/ @yacovm also

garisingh
2016-11-10 16:35
@jyellick - do you imagine that the "global" channel (used to be called the system channel) would be used to distribute the list of organizations to all all peers which connect to the ordering service? (not really a function of the ordering service but rather a use of the ordering service). My assumption would be that provides the "global" list of organizations that are part of the overall network and then when specifying access control at the channel level you would reference these organizations. Or did you imagine that organization info get duplicated for every chain (channel + ledger)?

jyellick
2016-11-10 16:36
@garisingh So, I think ultimately, we will need a private orderer only chain for a number of reasons. However, for simplicity's sake, I assumed every chain would have the full list of peer org certs in the config.

garisingh
2016-11-10 16:38
that implies that the list of possible peer orgs is always out of band from the overall "global" network and that orgs are added at the "chain" level

garisingh
2016-11-10 16:39
this is why I like to work backwards - meaning if I had a config file on each peer, what would need to be in it :wink:

garisingh
2016-11-10 16:39
then we figure out how to distribute it

garisingh
2016-11-10 16:42
in any case, it seems that within the fabric we need am entity structure known as an organization. we of course need that in the orderer as well. and I think it is more than just the org certificate

jyellick
2016-11-10 16:43
@garisingh I've always thought the notion of treating the 'system chain' specially was odd. It's just a chain, and it has ACLs, which sounds like a channel to me.

garisingh
2016-11-10 16:43
but I think that belongs mostly over in peer dev world

garisingh
2016-11-10 16:43
agreed

garisingh
2016-11-10 16:44
I guess we talk about a lot of conceptual entities which don't seem to exist really across the design :wink:

jyellick
2016-11-10 16:45
I would actually propose, that the 'orderer chain' is not special either. Instead, when creating a channel, I would specify the 'source chain' or some such thing, which is where the actual channel creation authorization policy is stored. So, you could start up an ordering service, and consortium A comes over, and you fire up a 'Consortium A ordering chain' on top of the initial chain, and consortium B joins, and you can do the same. That way you get instant multi-tenancy for free.

jyellick
2016-11-10 16:46
We definitely did invent the term 'Peer Org' while trying to write the bootstrap bdd I think... seems like it probably should have originated in the design, but some of these things are hard to see until you flesh them out

garisingh
2016-11-10 16:54
I am not faulting anyone here

garisingh
2016-11-10 16:55
and I believe designs evolve as you build. Just saying we might need to catch our breath again and see what new things have emerged

binhn
2016-11-10 17:14
@garisingh @jyellick app may create a “system” chain to orchestrate transactions, but that should be use-case specific, so i agree that we don’t treat a channel any more specific than others

tuand
2016-11-10 17:27
@c0rwin i'd say add it to the issue @yacovm is using for his api work ( don't know the number offhand )

shinsa
2016-11-11 06:48
has joined #fabric-consensus-dev

kostas
2016-11-11 16:34
Should be done now, thanks for the heads up.

oiakovlev
2016-11-11 18:43
Hi QQ: is there any signing transaction mechanism exist in HL? So let's say to check if transaction was signed by user A, for example? I realize that we can implement such signatures using certificate attributes, for example, and store the result in KVS. But something out of the box?

oiakovlev
2016-11-11 18:43
I'm talking in the view of v06 and also curious about plans for v1?

yacovm
2016-11-11 21:55
Question- there are configuration blocks that change the membership of the ordering service, right? On which channel/chain are they received? In a system-level chain, or in all chains? If it's the former- then why are they also sent on all chains? If the latter- then how is the peer expected to behave when it reads a configuration that tells it to connect to a new ordering instance in 1 chain but in other chains that block hasn't been received yet?

kostas
2016-11-12 16:21
Last time I checked, this notion of a system, common chain that is exposed to all peers is going away. You need the membership changes posted on every chain so that you ensure these are orderered on every chain.

kostas
2016-11-12 16:23
If there used to be orderers 1,2,3 around, and the receiver gets a block in chain A that says the orderers are now 1, 2, 3, 4 and in chain B no such message has arrived, then it shouldn't accept a message for chain B from orderer 4 (yet).

jyellick
2016-11-12 16:23
+1 to what Kostas says there. Treat each chain as if it had no dependencies. For v1 I think it is reasonable to require that all a peer's chains are from the same ordering service (so in the example given, it is likely fine to source all chains from the orderers 1,2,3 until 4 appears in all chains) , but I would suggest to keep the possibility in mind for the future.

yacovm
2016-11-12 16:36
that's exactly what I thought of- this functionality would allow a peer to get serviced by completely different ordering services

jyellick
2016-11-12 16:39
In the future, especially with sbft, I can imagine that different chains might be serviced by different sets of orderers. So I would try not to make any decisions which entirely preclude it

subzer0
2016-11-12 20:17
has joined #fabric-consensus-dev

kostas
2016-11-13 16:17
I'm trying to figure out what the new-chain configuration _exactly_ looks like. (I'm writing a test client that sends such transactions.) Chime in here, if you have thoughts: https://jira.hyperledger.org/browse/FAB-998?focusedCommentId=19750&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-19750

wale
2016-11-13 19:26
has joined #fabric-consensus-dev

xixuejia
2016-11-14 00:29
has joined #fabric-consensus-dev

simon
2016-11-14 07:52
hi

vukolic
2016-11-14 13:09
@yacovm could gossip help solve https://jira.hyperledger.org/browse/FAB-1096 - in the way described under item 3

vukolic
2016-11-14 13:09
?

yacovm
2016-11-14 13:12
Hi Marko. Are you asking whether we fill up holes of raw block dissemination in peers?

hgabor
2016-11-14 13:15
@vukolic we agreed in the following: > by the way - what we should have also is some refactoring so the System and Receiver APIs of SBFT are put as such so we can have other consensus implementations @simon we should discuss this here

simon
2016-11-14 13:15
yes

vukolic
2016-11-14 13:17
@yacovm I am asking whether you could

vukolic
2016-11-14 13:18
it seems to me that you could

vukolic
2016-11-14 13:18
so what "we" would be doing seems redundant

yacovm
2016-11-14 13:18
what do you mean whether we could? we already do :slightly_smiling_face:

vukolic
2016-11-14 13:18
that's my point

yacovm
2016-11-14 13:18
we presented this at a demo in front of everyone at RTP last week

vukolic
2016-11-14 13:18
so you think (agree) that gossip splves https://jira.hyperledger.org/browse/FAB-1096

vukolic
2016-11-14 13:18
?

vukolic
2016-11-14 13:18
(sorry haven't seen that demo)

yacovm
2016-11-14 13:19
(np) Wait, I don't understand- are you implying that the PBFT memory buffers depend on what its clients know?

vukolic
2016-11-14 13:19
ok - let me repeat

yacovm
2016-11-14 13:20
I'm on a call now, so maybe the people talking (in accented english) are interfering with my reading

vukolic
2016-11-14 13:20
now not all sbft orderers have the entire raw ledger

vukolic
2016-11-14 13:20
this is an implementation simplification

vukolic
2016-11-14 13:20
a big one

yacovm
2016-11-14 13:21
ok I dropped from the call just for you

vukolic
2016-11-14 13:21
now this is not a huge issue as orderes/consenters need not have the entire raw ledger

yacovm
2016-11-14 13:21
finally can think

yacovm
2016-11-14 13:21
wait wait

yacovm
2016-11-14 13:21
I was told that all PBFT orderers in the future would have the raw ledger

vukolic
2016-11-14 13:22
they would

yacovm
2016-11-14 13:22
what the gossip gives you is full state syncrhonization of the ledger among all peers

vukolic
2016-11-14 13:22
but they would have holes

yacovm
2016-11-14 13:23
wait

yacovm
2016-11-14 13:23
why would they have holes?

yacovm
2016-11-14 13:23
how is it even possible?

vukolic
2016-11-14 13:23
because it makes our life MUCH easier

yacovm
2016-11-14 13:23
oh, wait actually it's possible

yacovm
2016-11-14 13:23
if a PBFT node was dead

vukolic
2016-11-14 13:23
yes :slightly_smiling_face:

yacovm
2016-11-14 13:23
and is now alive

vukolic
2016-11-14 13:23
for example

vukolic
2016-11-14 13:23
now

yacovm
2016-11-14 13:23
then you don't replicate the state among them, you just need it for ordering

vukolic
2016-11-14 13:23
peers need holes filled

yacovm
2016-11-14 13:23
we do that

vukolic
2016-11-14 13:23
exactly

yacovm
2016-11-14 13:24
we have 2 layers of state transfer actually

vukolic
2016-11-14 13:24
and https://jira.hyperledger.org/browse/FAB-1096 describes 3 ways for peers to have hole-less raw ledger

yacovm
2016-11-14 13:24
the first layer gives you dissemination from peers that connect to ordering service and then forwards blocks to peers

vukolic
2016-11-14 13:24
I am asking can we assume gossip solves this - per item 3?

yacovm
2016-11-14 13:24
to make things scale

yacovm
2016-11-14 13:24
the 2nd layer of state transfer gives you an ability to fill up blocks that the ordering service isn't sending anymore

yacovm
2016-11-14 13:25
for example- all nodes of PBFT service now reached block 100 and you need block 80, then you (the peer) go to another peer that you (somehow :slightly_smiling_face: ) know it has block 80 and ask it for the block

vukolic
2016-11-14 13:26
fo rinstance - if your implementation is pull based

yacovm
2016-11-14 13:26
the state transfer is indeed pull based of course because only you care about the blocks you don't have, no one cares about you in a byzantine environment

vukolic
2016-11-14 13:26
right ok

vukolic
2016-11-14 13:26
so my point

vukolic
2016-11-14 13:27
us filling in raw ledger holes @orderers/consenters

vukolic
2016-11-14 13:27
would be redoing a subset of your job

vukolic
2016-11-14 13:27
which obviously I want to avoid

yacovm
2016-11-14 13:27
there is a small problem here

yacovm
2016-11-14 13:27
you need to make sure that enough peers got a block you sent

vukolic
2016-11-14 13:28
we do that

yacovm
2016-11-14 13:28
oh cool

yacovm
2016-11-14 13:28
then we're OK

vukolic
2016-11-14 13:28
replace peers with orderers/consenters

vukolic
2016-11-14 13:28
and we do that

yacovm
2016-11-14 13:28
no

yacovm
2016-11-14 13:28
wait

yacovm
2016-11-14 13:28
I can't replace it

vukolic
2016-11-14 13:28
that depends

yacovm
2016-11-14 13:28
what if you sent block 100 and no one asked it?

yacovm
2016-11-14 13:29
ok well you can't "send" a block

yacovm
2016-11-14 13:29
you implement the deliver

yacovm
2016-11-14 13:29
(gRPC call)

vukolic
2016-11-14 13:29
yes

vukolic
2016-11-14 13:29
so what consensus can offer to gossip

vukolic
2016-11-14 13:29
is the following guarantee

yacovm
2016-11-14 13:29
so- if for instance, as an extreme case, no peer called Deliver and you have 10000 blocks- then you need to keep them all in-memory

yacovm
2016-11-14 13:29
or on disk

simon
2016-11-14 13:29
who is you?

yacovm
2016-11-14 13:29
PBFT of course

vukolic
2016-11-14 13:30
every raw ledger batch is replicated across a subset of correct orderers/consenters

vukolic
2016-11-14 13:30
then gossip takes over

vukolic
2016-11-14 13:30
and replicates this to all peers as well as possibly to consenters/orderers that do not have this batch

yacovm
2016-11-14 13:30
nope

vukolic
2016-11-14 13:30
ok - why not?

yacovm
2016-11-14 13:30
that's not how it's done

yacovm
2016-11-14 13:31
gossip doesn't live inside the ordering service

yacovm
2016-11-14 13:31
some peers call Deliver on the PBFT instances

vukolic
2016-11-14 13:31
it does not have to

yacovm
2016-11-14 13:31
they get blocks and they send them to other peers via gossip

vukolic
2016-11-14 13:31
I still do not see the issue?

yacovm
2016-11-14 13:31
I don't think there is an issue if you save batches on disk

vukolic
2016-11-14 13:32
we do

yacovm
2016-11-14 13:32
what's the issue then?

vukolic
2016-11-14 13:32
:slightly_smiling_face:

vukolic
2016-11-14 13:32
ok let's repeat

yacovm
2016-11-14 13:32
you want maybe a hangout?

vukolic
2016-11-14 13:33
normally in a textbook pbft implementations - all pbft nodes - consenters/orderers in our case have the ENTIRE history

vukolic
2016-11-14 13:33
history = raw ledger (in HL speak)

vukolic
2016-11-14 13:33
now

yacovm
2016-11-14 13:33
and you said you won't have it, so you won't have to sync slow nodes or nodes that came alive

vukolic
2016-11-14 13:33
exactly - because we do not need that for ordering (modulo consenter reconfiguration - but let's put that aside now)

vukolic
2016-11-14 13:34
it is peers who need hole-less RL

vukolic
2016-11-14 13:34
not conenters/orderers

vukolic
2016-11-14 13:34
hence I want to not implement state transfer at all @consensus

yacovm
2016-11-14 13:34
yeah, and like I said- if a peer `p` knows of a peer `q` who has a block that `p` needs it'll get it from `q`

vukolic
2016-11-14 13:34
but to reuse gossip for that

vukolic
2016-11-14 13:36
is "peer"="gossip peer" or "peer"="HL peer"

yacovm
2016-11-14 13:36
oh, every HL peer is a gossip peer of course

vukolic
2016-11-14 13:36
yes but

vukolic
2016-11-14 13:36
I am alluding to the case

yacovm
2016-11-14 13:36
there are either peers, nodes (orderer, CAs etc) and apps

vukolic
2016-11-14 13:36
where consenters are also gossip peers

yacovm
2016-11-14 13:36
no they are not

vukolic
2016-11-14 13:36
ok - why not?

yacovm
2016-11-14 13:37
ordering service nodes, have no gossip code in them

vukolic
2016-11-14 13:37
well

vukolic
2016-11-14 13:37
this is related to hookiing the two up

vukolic
2016-11-14 13:37
what is the showstopper to do that?

yacovm
2016-11-14 13:37
@jyellick and @kostas basically :wink:

vukolic
2016-11-14 13:37
aha

vukolic
2016-11-14 13:37
so soft problem

vukolic
2016-11-14 13:38
solo does not need that

vukolic
2016-11-14 13:38
and probably CFT does not either

vukolic
2016-11-14 13:38
but for BFT we may want to hook up

yacovm
2016-11-14 13:38
I am indifferent to architecture arguments (unless they effect me), I'm just a simple man who writes code.... if you want to open something for a discussion and that is related to ordering service you should ask them

vukolic
2016-11-14 13:39
well let's put that aside

simon
2016-11-14 13:39
@vukolic is your goal to fill the gaps in the orderer raw ledger?

simon
2016-11-14 13:39
or is your goal for the committers to tolerate gaps

vukolic
2016-11-14 13:39
@simon this is not needed unless we have consenter reocnfiguration

yacovm
2016-11-14 13:39
I thought he wanted to full up holes in peers

vukolic
2016-11-14 13:39
but with consenter reconfiguration this is needed as well

vukolic
2016-11-14 13:40
so I want gossip capability from consenters to peers in the first place

yacovm
2016-11-14 13:40
you want maybe to chime in the consensus scrum today and ask that?

vukolic
2016-11-14 13:40
but also, eventually, among consenters themselves

vukolic
2016-11-14 13:40
ask what?

yacovm
2016-11-14 13:40
"so I want gossip capability from consenters to peers in the first place" - what does that mean precisely?

yacovm
2016-11-14 13:40
I mean- to talk about any architectural change you may want

vukolic
2016-11-14 13:41
this is not an architectural change

yacovm
2016-11-14 13:41
so I still don't understand what change you're suggesting exactly :confused:

vukolic
2016-11-14 13:41
this is making gossip part of our bft consensus service implenmentation

yacovm
2016-11-14 13:41
"so I want gossip capability from consenters to peers in the first place" - what does that mean

yacovm
2016-11-14 13:41
ohhhhhh

yacovm
2016-11-14 13:42
you want to use our gossip code in PBFT code?

vukolic
2016-11-14 13:42
well

vukolic
2016-11-14 13:42
in a structured manner

vukolic
2016-11-14 13:42
but the answer is yes

yacovm
2016-11-14 13:42
hmmm but we can only replicate raw ledgr blocks

yacovm
2016-11-14 13:42
is that ok?

vukolic
2016-11-14 13:42
that's what I need

yacovm
2016-11-14 13:42
I mean, the multi-signed ledger blocks

yacovm
2016-11-14 13:42
not something "uncooked"

vukolic
2016-11-14 13:42
that's what we have :slightly_smiling_face:

yacovm
2016-11-14 13:43
ok then I would be honored :wink:

vukolic
2016-11-14 13:43
my pleasure :slightly_smiling_face:

yacovm
2016-11-14 13:43
but you need I guess to talk about it with others, don't you?

vukolic
2016-11-14 13:43
I talked to @simon :slightly_smiling_face:

yacovm
2016-11-14 13:43
because from what I understood, @jyellick and @kostas wanted the ordering service to be "pure" and without gossip code

simon
2016-11-14 13:44
so would orderers fill their raw ledger?

vukolic
2016-11-14 13:44
possibly yes

vukolic
2016-11-14 13:44
@simon ^^^

simon
2016-11-14 13:44
ah

simon
2016-11-14 13:44
okay

vukolic
2016-11-14 13:44
(this is needed for reconfig)

simon
2016-11-14 13:44
so that would be a combination of (3) to achieve (2)?

yacovm
2016-11-14 13:44
I have a better idea though, marko.

yacovm
2016-11-14 13:45
I have a pull module in the gossip code

vukolic
2016-11-14 13:45
listening

yacovm
2016-11-14 13:45
it's pluggable

yacovm
2016-11-14 13:45
you can maybe use it to sync the blocks you need

yacovm
2016-11-14 13:45
without running a fully fledges gossip component

vukolic
2016-11-14 13:45
this is a lower level detail

vukolic
2016-11-14 13:45
let's talk about that

vukolic
2016-11-14 13:45
it';s simpler

yacovm
2016-11-14 13:46
this isn't such a lower level detail. You already know how to send messages, right?

yacovm
2016-11-14 13:46
from orderers

yacovm
2016-11-14 13:46
to orderers

yacovm
2016-11-14 13:46
so you can use the module (maybe) to synchronize the state. you just need to "teach" it the following things:

vukolic
2016-11-14 13:46
yes we do that :slightly_smiling_face:

yacovm
2016-11-14 13:46

yacovm
2016-11-14 13:47
this is the protocol:

yacovm
2016-11-14 13:47

yacovm
2016-11-14 13:47
oh the copy-paste beheaded the right guy :open_mouth:

vukolic
2016-11-14 13:50
but I would like it more high level

vukolic
2016-11-14 13:50
I would like gossip to take care about falling back to other sources if the source I selected does not work

vukolic
2016-11-14 13:50
we have a hello message of our own - but basically the goal would be to have things happen automagically

vukolic
2016-11-14 13:51
rather than re-doing low level sends

vukolic
2016-11-14 13:55
or I have to select right away the worst-case number of remote peers?

jyellick
2016-11-14 13:58
@vukolic Requiring that ordering nodes be able to initiate network connections to any node which is not another orderer is a bit of a non-starter for the "as a service" model.

vukolic
2016-11-14 13:59
@jyellick you seem to be implicitly assuming this is a push-based gossip

vukolic
2016-11-14 13:59
it seems to me that it is a pull-based one

jyellick
2016-11-14 14:00
So what is the flow for reconfiguration? Ordering node starts up, and waits for a gossiping peer to connect to it?

yacovm
2016-11-14 14:00
Marko, the pull protocol in gossip is optimized for gossip pulling of blocks which are in-flight

yacovm
2016-11-14 14:01
it's not optimized for your use case but I think it can be very easily

yacovm
2016-11-14 14:01
yeah and I agree with what Jason said

yacovm
2016-11-14 14:01
It doesn't make any sense having an orderer node connect to a *none* orderer node, or even pull information out of it.

vukolic
2016-11-14 14:02
(where did I write I want that?)

vukolic
2016-11-14 14:02
I want

yacovm
2016-11-14 14:02
"other sources"

vukolic
2016-11-14 14:02
1) peers pull info from orderers

vukolic
2016-11-14 14:02
2) orderes pull info from other orderers

yacovm
2016-11-14 14:03
ok, so (1) is achieved via Deliver, and (2) can be achieved if you adopt a mutation of the pull mechanism in gossip

jyellick
2016-11-14 14:04
But (2) does not need gossip? The cardinality of the orderer set is relatively small, why not simply pull the blocks over a normal stream, as we did before for 0.5/0.6?

yacovm
2016-11-14 14:04
Yeah but you see, gossip comes with a very big bunch of other stuff

yacovm
2016-11-14 14:06
I think since you already know how to send messages between orderers you could simply instantiate a `PullEngine` ( a gossip object) and give it a customized `PullAdapter` which will do the work for you

jyellick
2016-11-14 14:08
I agree, that possibly re-using the gossip state transfer code would save the orderer the headache of re implementing it, though with in order block retrieval, I don't think it would be a very complicated piece of code. Either way, if it is just state transfer among orderers, gossip/not seems like an implementation detail.

yacovm
2016-11-14 14:09
wait, why not have the orderer instance that needs a block simply call Deliver on another ordering instance?

yacovm
2016-11-14 14:09
or maybe I'm missing something here

simon
2016-11-14 14:10
it doesn't know which one to ask

simon
2016-11-14 14:11
and it doesn't maintain a list of which blocks it doesn't have

simon
2016-11-14 14:11
i think we will need a "headers only" deliver anyways

simon
2016-11-14 14:11
and in that case, the orderer could subscribe to headers only from all connected orderers

simon
2016-11-14 14:12
and that's half of a push mechanism

yacovm
2016-11-14 14:12
oh... well, that's a problem then.

yacovm
2016-11-14 14:12
why doesn't it know which one to ask?

simon
2016-11-14 14:12
and you'd put the window at the lowest gap you have

simon
2016-11-14 14:13
so if you hear about a batch you have a gap at, you'd ask for that batch

simon
2016-11-14 14:13
state transfer essentially

yacovm
2016-11-14 14:13
wait simon I don't understand

yacovm
2016-11-14 14:13
how can you not know you have a gap?

yacovm
2016-11-14 14:13
just read the sequences

simon
2016-11-14 14:13
and then you yourself in turn would have to notify all listeners that have their window at this batch, etc.

yacovm
2016-11-14 14:13
if it's sequential- you have no gaps. else- you have a gap?

simon
2016-11-14 14:13
some sort of gossip

simon
2016-11-14 14:13
you know you have a gap

simon
2016-11-14 14:14
you just don't know who can fill it

yacovm
2016-11-14 14:14
so ask all other orderers

yacovm
2016-11-14 14:14
whether they have the item

simon
2016-11-14 14:14
that's gossip, no?

simon
2016-11-14 14:14
not quite

yacovm
2016-11-14 14:14
yeah but I mean- you can use the pull engine in gossip to do that

simon
2016-11-14 14:14
all could ask the same guy

yacovm
2016-11-14 14:14
no

yacovm
2016-11-14 14:15
they ask everyone

yacovm
2016-11-14 14:15
or a subset

simon
2016-11-14 14:15
yes, but if there is just one reachable guy

yacovm
2016-11-14 14:15
then ask him... what's the problem :confused:

simon
2016-11-14 14:15
then all of them might be transfering from that one

yacovm
2016-11-14 14:15
so?

simon
2016-11-14 14:15
you overload one guy

yacovm
2016-11-14 14:15
but there is only 1 reachable like you just said

yacovm
2016-11-14 14:15
wait how is that even possible?

yacovm
2016-11-14 14:16
a 1-way network partition?

simon
2016-11-14 14:16
with periodic gossip, you distribute the load

yacovm
2016-11-14 14:16
the pull engine does just that

yacovm
2016-11-14 14:16
if you have 100 blocks missing

yacovm
2016-11-14 14:16
you do round-robin, or random

yacovm
2016-11-14 14:16
I don't remember what

yacovm
2016-11-14 14:16
I think random

yacovm
2016-11-14 14:17
and you ask 100/K blocks from each 1..K peers

simon
2016-11-14 14:17
okay

yacovm
2016-11-14 14:19
you can take a look if you want at `gossip/gossip/algo/pull*.go`

jyellick
2016-11-14 14:20
i'm still not really seeing the scenario, why do so few ordering nodes have blocks? An orderer should be able to very quickly detect when it misses a sequence number in the sbft model, and it will generally have the set of orderers which it can retrieve those blocks from.

jyellick
2016-11-14 14:21
> i think we will need a "headers only" deliver anyways I do concur with this though

simon
2016-11-14 14:21
if the gap is larger than 1, it does not know who has the previous blocks

jyellick
2016-11-14 14:23
True, though it still seems like in most sane scenarios, the probability of picking someone with all the blocks is pretty high. In general, why would we expect so many holes across all the nodes?

simon
2016-11-14 14:23
it's possible

simon
2016-11-14 14:23
isn't that all that counts?

simon
2016-11-14 14:24
this is about how to deal with such a situation

jyellick
2016-11-14 14:25
Tolerating a situation and optimizing for it are two different things. I'm still not seeing concretely a likely scenario for how the network ends up in this state (obviously carefully orchestrated failures can do this, but seems very unlikely)

simon
2016-11-14 14:25
sure, but how do you deal with it?

simon
2016-11-14 14:26
what's the simplest way to deal with it

jyellick
2016-11-14 14:26
Well, naively, you randomly pick an orderer, ask for the block, and if he has it great, otherwise, ask someone else until you get a successful reply.


simon
2016-11-14 14:27
so you basically update the deliver window?

simon
2016-11-14 14:27
or do you have some sort of non-streaming rpc to ask for that block?

yacovm
2016-11-14 14:29
they don't (at least from what i know)

jyellick
2016-11-14 14:31
So, you pick the lowest block number you need, send a deliver (normal streaming RPC), and get blocks through the number you do need.

jyellick
2016-11-14 14:31
If you get back a 404 not found, you switch orderers and try again.

jyellick
2016-11-14 14:32
I agree it's inefficient when there are many holes spread across many ordering nodes, but the logic is simple, easy to implement, and doesn't require exposing any new RPCs.

jyellick
2016-11-14 14:33
By setting the window size on the Deliver SeekInfo, you can actually replicate a 'Pull just n blocks'

jyellick
2016-11-14 14:34
(So, if you are missing blocks 3-5, and 8, you can call Deliver() SeekInfo(seekto=3 window=3), SeekInfo(seekto=8, window=1) and that will retrieve exactly blocks 3-5 and 8 )

vukolic
2016-11-14 14:44
TL;DR?

vukolic
2016-11-14 14:45
(had a call in the meantime)

simon
2016-11-14 14:46
can you loop me in on the scrum hangout? i'll be travelling on train

tuand
2016-11-14 14:47
@simon , tell me what to do to loop you in ?

tuand
2016-11-14 14:58
scrum ...

2016-11-14 14:58
@tuand has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/nf4tqrdvbbbozc2iparf56e65ee.

yacovm
2016-11-14 15:08
pascal...? :open_mouth: @vukolic

vukolic
2016-11-14 15:10
pluscal



vukolic
2016-11-14 15:15
(although some of us have been using it way before Amazon)

simon
2016-11-14 15:15
well about that

simon
2016-11-14 15:15
nobody invited me

vukolic
2016-11-14 15:16
to amazon? pluscal?

simon
2016-11-14 15:16
to the scrum

vukolic
2016-11-14 15:16
or just scrum? :slightly_smiling_face:

yacovm
2016-11-14 15:17
thank god I thought you said Pascal

yacovm
2016-11-14 15:18
hey @jyellick I have an important question regarding multi-channel and ordering service

yacovm
2016-11-14 15:18
let's say we have orgs A, B and C in a channel X

yacovm
2016-11-14 15:19
after a while, channel X is: B, C only

yacovm
2016-11-14 15:19
what happens to the peers that call deliver on X to the orderer in org A?

jyellick
2016-11-14 15:24
They'll get back an error in reply to their seek request

yacovm
2016-11-14 15:24
how?

yacovm
2016-11-14 15:24
they'll get a downstream message that contains an error?

yacovm
2016-11-14 15:26
(deliver response)

yacovm
2016-11-14 15:26
and how will they know that it is because they're no longer in the channel, and there isn't a different fault that returns an error? @jyellick ?

simon
2016-11-14 15:28
oh there is a problem with new-view messages and null requests in sbft

simon
2016-11-14 15:29
i thought i had fixed it for the hello message, but i did not

simon
2016-11-14 15:29
so the problem is that a null request after view change produces an empty batch

simon
2016-11-14 15:29
but for that the hash of the previous batch needs to be known

simon
2016-11-14 15:33
but if we have a gap, we don't know that hash, so we cannot validate that batch

hgabor
2016-11-14 15:40
does anybody have suggestions how to do the network testing of sbft?

simon
2016-11-14 15:47
my guess is that the p sets in the new-view message should allow you to verify the last batch hash

simon
2016-11-14 15:49
hum hum hum

simon
2016-11-14 15:49
i suspect a reconnect could fix all of this

vukolic
2016-11-14 15:51
why are there null requests in sbft

vukolic
2016-11-14 15:51
pbft had null requests for holes - if I recall correctly

vukolic
2016-11-14 15:51
there are no holes in sbft by design

jyellick
2016-11-14 18:11
> and how will they know that it is because they're no longer in the channel, and there isn't a different fault that returns an error? @yacovm This is something I'd welcome feedback on. We could return a 403 FORBIDDEN or a 404 NOT_FOUND. This somewhat goes to the idea of "Do you return account not found or bad password" on authentication failure. We could uniformly respond with a 403 when someone accesses a channel which doesn't exist or they are not authorized on, or, we could return a 404 when the channel doesn't exist, and a 403 when they are not authorized. The question is whether leaking knowledge that the chain exists is okay or not.

yacovm
2016-11-14 18:11
why use an HTTP status code?

jyellick
2016-11-14 18:11
That's what the fabric has standarized on for error codes

yacovm
2016-11-14 18:12
lol you are REST-ful?

yacovm
2016-11-14 18:12
just kidding

yacovm
2016-11-14 18:12
where is common.Status?

jyellick
2016-11-14 18:12
`fabric/protos/common/common.proto`

yacovm
2016-11-14 18:13
I see... so I think that forbidden is good for this case.

yacovm
2016-11-14 18:13
actually

yacovm
2016-11-14 18:14
wait I take that back

yacovm
2016-11-14 18:14
is the action a peer needs to take in case of 404 the same if it's actually 403?

yacovm
2016-11-14 18:14
if it is- then maybe use always 404?

jyellick
2016-11-14 18:16
I see merits to all the options. My gut says use 404 for 'does not exist', and 403 when the user is not authorized.

jyellick
2016-11-14 18:18
Because chain IDs are going to be generates as hashes, searching for a chain's existence exhaustively does not seem feasible, and the existence of a chain doesn't allow the attacker any real knowledge about its membership or contents, it seems better to leak the chain's existence, in favor of making errors less opaque to the client.

jyellick
2016-11-14 18:18
Just my gut, if anyone has a strong argument for another policy, I'm certainly ready to be swayed

yacovm
2016-11-14 18:24
how long is the hash?

yacovm
2016-11-14 18:25
*it seems better to leak the chain's existence, in favor of making errors less opaque to the client.* - I don't agree with this statement. Isn't this an API which is being used inside the project? it's not like we expose this API to customers, so making it informative and opaque isn't a big concern

jyellick
2016-11-14 18:37
> how long is the hash? A good question, I assume 256 bits, but this needs to be finalized > Isn't this an API which is being used inside the project? it's not like we expose this API to customers, so making it informative and opaque isn't a big concern I'm not sure what you mean? The peer is a client to the ordering service. We can either be opaque about the reason why a seek request was rejected (the client knows either the chain doesn't exist, or the client is not authorized to view the service, but does not know which) or, we can be explicit about why (replying 403 on not authorized, and 404 on doesn't exist).

kostas
2016-11-14 18:46
(I agree with the 403/404 thinking.)

yacovm
2016-11-14 19:13
But jason, the client is someone that writes either the peer or the sdk, meaning a fabric developer

yacovm
2016-11-14 19:13
Being not opaque is good for debugging

yacovm
2016-11-14 19:14
And development

yacovm
2016-11-14 19:14
It has no value in the day to day scenario

jyellick
2016-11-14 19:31
I'm not following. How is having 1 error reply for 2 different conditions easier to debug than 2 error replies, one for each of 2 conditions?

yacovm
2016-11-14 20:00
it's not easier to debug, I'm saying- after the debugging phase is over, you don't need to debug anymore.

yacovm
2016-11-14 20:00
and you get more information hiding

yacovm
2016-11-14 20:10
I meant, being not opaque of course

cca
2016-11-14 20:13
@yacovm, @vukolic: reading your discussion from earlier on using gossip module's state transfer also to support the state transfers needed by the consensus protocol. YES! this is how it should be! no duplication of a state transfer function. there should not be concerns about violating modularity, but interfaces will be important: since the consensus API here already contains hash-chained batches/blocks (unlike, say, etcd or ZK), the main prerequisites exist on both sides.

yacovm
2016-11-14 20:14
let's just be clear

yacovm
2016-11-14 20:14
I didn't mean using gossip's state transfer module

yacovm
2016-11-14 20:14
I meant using gossip's pull module

yacovm
2016-11-14 20:15
they are completely different stuff and the state transfer heavily relies on gossip (I think, @c0rwin correct me if I'm wrong)

c0rwin
2016-11-14 20:17
state transfer relies on gossip to get most up to date info about ledger height on other nodes

c0rwin
2016-11-14 20:18
and getting blocks of course

yacovm
2016-11-14 20:19
by the way @cca , if you want to do that, you'll need to slightly mutate the pull mechanism, because the way it is now - it doesn't fit to the PBFT use case 100%....

cca
2016-11-14 20:20
OK, i am not aware of the details, sorry. but in principle, from knowing the "tip" (= most uptodate block and its hash) one can reconstruct the correct, agreed-on sequence of raw-ledger-batches. i understood that a consenter may need that and that the peers need that. from very far, this looks the same.

cca
2016-11-14 20:21
but again i leave this to you, i am not 'in' the code

abhishekseth
2016-11-15 06:02
Hey all, I am running with a setup which has two peers running on two different physical machines. When security was not enabled, i was able to have communication between them. But now, I have security enabled and now the peers are not able to communicate and it is giving certification error. I am using the following docker-compose.yaml file: # membersrvc: # image: hyperledger/fabric-membersrvc # ports: # - "50051:50051" # - "7054:7054" # command: membersrvc vp1: image: hyperledger/fabric-peer ports: - "5000:5000" - "7051:7051" - "7050:7050" - "30303:30303" - "30304:30304" environment: - CORE_PEER_ADDRESSAUTODETECT=false - CORE_VM_ENDPOINT=unix:///var/run/docker.sock - CORE_LOGGING_LEVEL=DEBUG - CORE_PEER_ID=vp1 - CORE_SECURITY_ENABLED=true #- CORE_SECURITY_PRIVACY=true - CORE_SECURITY_ENROLLID=test_vp0 - CORE_SECURITY_ENROLLSECRET=MwYpmSRjupbT - CORE_PEER_DISCOVERY_ROOTNODE=9.109.251.105:7051 - CORE_PEER_PKI_ECA_PADDR=9.109.251.105:7054 - CORE_PEER_PKI_TCA_PADDR=9.109.251.105:7054 - CORE_PEER_PKI_TLSCA_PADDR=9.109.251.105:7054 - CORE_CHAINCODE_DEPLOYTIMEOUT=180000 # - MEMBERSRVC_CA_ACA_ENABLED=true volumes: - /var/run/docker.sock:/var/run/docker.sock # links: # - membersrvc command: sh -c "sleep 5; peer node start” Any help is appreciated.

simon
2016-11-15 07:54
@vukolic there are null requests so that you can play to the latest view number from the chain

scottz
2016-11-15 15:16
@yacovm @jyellick If you are looking for another opinion: more specific info using 403 and 404 seems better than using one error code for two usecases. Yacov, the debugging phase never ends; we must expect that additions and changes to SDKs (and new SDKs) will always be done. Is there a reason why we should hide that information from ongoing development and testing?

yacovm
2016-11-15 15:18
it's a trade-off, if you hide information you reduce the attack surface. let's say that you're an org that was evicted of a channel, and you want to know if the channel still exists or not, 403 + 404 will tell you.

scottz
2016-11-15 15:36
it does not matter. a booted client already has the hash for the chaincode, and will know the problem reason. Nevertheless, if this is an internal API, then is there any risk at all?

yacovm
2016-11-15 15:38
this is not an internal API network-wise

scottz
2016-11-15 15:41
ok, then can discard my 2nd sentence. first sentence still stands. and since that will be a problem regardless, then maybe standard practice will be to destroy entire channel when a member leaves/expelled, and create a new channel with remaining members after a short time.

jyellick
2016-11-15 15:42
I'd argue that once a channel is created, it can never be truly destroyed. You could cut off access to it, but you should never re-use a chainID, so I would think deleting a chain blacklists it as a chainID in perpetuity and so everyone would simply get back a 403

yacovm
2016-11-15 15:42
destroying a channel once a member leaves is not a smart thing to do

yacovm
2016-11-15 15:42
you lose all the chain

scottz
2016-11-15 15:49
my first sentence still holds true. an expelled member will already have the hash, right? so if they are malicious, they will know the correct hash, whether or not they get a 404 or 403. The benefits of using both 404 and 403 hold true for the rest of the world. Rejecting their userid/password must be enforced; let COP enforce that.

yacovm
2016-11-15 15:50
actually what's rejected is their organization, just to be precise.

yacovm
2016-11-15 15:50
when you boot a peer you boot the entire org

lhaskins
2016-11-15 15:53
@abhishekseth: when running with security, you should uncomment the membersrvc stanza and references

dongmingh
2016-11-15 16:00
has joined #fabric-consensus-dev

scottz
2016-11-15 16:06
thanks for the clarification. I think you are saying that each channel has a list of orgs/members , and we may be providing a way for the rest of the orgs to boot a member, and blacklist them so the channel can reject additional requests by that member to join again. (For my own learning: is this logic in the sdk, or fabric itself?) Wouldn't they just get a new name/address and attempt to join as a new org (rendering all that code useless)? What malicious-usecases could we really prevent by not providing 403/404 differentiation?

yacovm
2016-11-15 16:10
a channel is defined by a list of its peers and the orgs of the peers

yacovm
2016-11-15 16:14
but the ACL is enforced in 2 different places: - in the ordering service it's enforced per org, meaning- any peer from an org can call `Deliver` from the channel - in the gossip laye I enforce a mix of 2 rules: - - a block is never sent to a peer of an org not in the channel - a block is never sent to a peer not in the channel, unless it is the peer that originally pulled it from the ordering service, and there is only 1 as such, per org. That is in theory of course because I'm implementing this as we speak

yacovm
2016-11-15 21:52
by the way, there is a posting on #fabric-peer-dev by @weeds , you (who are on this channel and didn't see) may want to read it, and I also posted some comments there, @jyellick and @kostas your input is most welcome

abhishekseth
2016-11-16 07:30
@lhaskins , I know that i should uncomment the membersrvc stanza and all. But now since I am running with two physical machines setup, member services are running on another machine only. I am just trying to refer that membersrvc using the ip and port of that machine using: - CORE_PEER_PKI_ECA_PADDR=9.109.251.105:7054 - CORE_PEER_PKI_TCA_PADDR=9.109.251.105:7054 - CORE_PEER_PKI_TLSCA_PADDR=9.109.251.105:7054 Do I need to do something more to get the communication between these two systems?

simon
2016-11-16 08:54
@scottz: are you guys running tests for the kafka and sbft orderers?

lhaskins
2016-11-16 13:27
@abhishekseth : I’d like to debug this with you and see what errors you are seeing. I’ll DM you directly. @simon: Yes, I am adding more tests for the kafka orderer now; sbft is in our backlog for more tests as well.

simon
2016-11-16 13:31
great


hgabor
2016-11-16 13:39
new patchset now

hgabor
2016-11-17 11:37
does anybody know ./orderer/common/bootstrap/static/ ?

garisingh
2016-11-17 11:55
if something is breaking, I believe that Kostas fixed this in https://gerrit.hyperledger.org/r/#/c/2179/

garisingh
2016-11-17 11:55
so you might want to +2, merge and rebase your change

hgabor
2016-11-17 12:12
no, it is OK (I merged it)

hgabor
2016-11-17 12:14
there is a GenesisBlock function which creates a genesis block using the current chain (chain ID) - we would like to have the same genesis block on all replicas of the network, how could we achieve this?

hgabor
2016-11-17 12:15
@simon thought of having some kind of configuration in the network config file (a JSON)

hgabor
2016-11-17 12:16
but as I see it is not possible to inject the chain ID from outside

garisingh
2016-11-17 12:21
the working concept at this point is that when you call JoinChannel on a peer, you would provide the "genesis block" or latest "config" block (assuming you added a new organization and/or changed membership for a channel) to that peer. Is that what you mean?

hgabor
2016-11-17 12:23
what is 'JoinChannel'?

garisingh
2016-11-17 12:30
An API which I don't think exists anywhere in the codebase yet but which will be invoked on peers in order for them to join the specified channel

hgabor
2016-11-17 12:31
I was talking about SBFT. in an sbft orderer peer, where should this API be called?

garisingh
2016-11-17 12:32
create channel would be called on the SBFT orderer

garisingh
2016-11-17 12:33
the output of that would be the "genesis" and/or latest "config" block which would then be provided to peers which are to join that channel.

hgabor
2016-11-17 12:33
currently, in sbft we directly create a fileledger which needs a genesis block as an argument. I guess this direct ledger creation will change with this

hgabor
2016-11-17 12:40
until that, we need to solve somehow this genesis block issue

garisingh
2016-11-17 12:44
probably have to wait for @jyellick or @kostas - because I believe that we do have the static block creation in the current code (as channels are not yet supported)

hgabor
2016-11-17 12:45
yeah the static block creation is static.New().Genesisblock() I guess. but it uses a random number and that results in different genesis block for every peer. somehow I need to workaround thos

hgabor
2016-11-17 13:15
maybe @vukolic knows something about this

vukolic
2016-11-17 13:16
@hgabor - no clue how we create genesis blocks :slightly_smiling_face:

vukolic
2016-11-17 13:16
this is sth indeed @jyellick may answer

hgabor
2016-11-17 13:16
I am very very sad :disappointed:

vukolic
2016-11-17 13:16
dont' be

vukolic
2016-11-17 13:16
we work w/o genesis

vukolic
2016-11-17 13:16
:slightly_smiling_face:

vukolic
2016-11-17 13:16
we just came to be

vukolic
2016-11-17 13:16
no genesis

hgabor
2016-11-17 13:17
:slightly_smiling_face:

simon
2016-11-17 13:27
@garisingh: sbft doesn't do channels

garisingh
2016-11-17 13:27
yeah - I figured that part out - neither do any of the other orderers at this point

simon
2016-11-17 13:27
:slightly_smiling_face:

yacovm
2016-11-17 13:28
Simon or @vukolic is there any progress/updates/decision about that block gap/hole sync that you asked me that time (using the pull module)

vukolic
2016-11-17 13:30
not yet but I am more and more sure that we need to talk :slightly_smiling_face:

yacovm
2016-11-17 13:31
sure, ping me when you need on slack

vukolic
2016-11-17 13:32
i've just a put a 2x label (consensus/gossip) to https://jira.hyperledger.org/browse/FAB-1096

tuand
2016-11-17 13:50
The genesis block issue is discussed via FAB-359, FAB-665 and FAB-666

tuand
2016-11-17 13:51
The thought is that we create a genesis block that contains all the configuration needed to bootstrap an orderer


jyellick
2016-11-17 13:59
@hgabor Basically, the static bootstrapper was put in place as a stopgap to provide a genesis block when one has not already been generated, @tuand is working on a tool to actually create the 'real' genesis block that can be shared among orderers

jyellick
2016-11-17 14:00
@tuand We should also probably talk about how the genesis block stuff has changed a little bit this week. As @garisingh alluded to, since the concept of 'system chain' is gone, your tool will be more constrained to bootstrapping an orderer, not a peer. The big pieces will be the orderer configuration, and the channel creation policies.

tuand
2016-11-17 14:03
agreed. I'm reading @elli 's msp doc today. I also understand that @bmos299 and @scottz are starting to look at the peer joinchannel flow

simon
2016-11-17 14:06
what's the short term fix to get @hgabor unstuck?

simon
2016-11-17 14:07
right now sbft has a mode to ingest a json setup file and create state for a replica

simon
2016-11-17 14:07
this now needs to be extended to contain a genesis block

jyellick
2016-11-17 14:12
@simon @hgabor An easy way to get unstuck locally would be simply to comment out the random chainID generation and set that chainID statically. For a longer term solution the genesis block creation tool should take that sbft config and put it in the genesis block so that we don't have two different bootstrapping methods.

simon
2016-11-17 14:12
i agree about the long term plan

hgabor
2016-11-17 14:13
lets see if locally commenting that solves anything btw

simon
2016-11-17 14:13
i guess replacing the random is an option

simon
2016-11-17 14:13
but it is ugly

uramoto
2016-11-17 14:51
has joined #fabric-consensus-dev

tuand
2016-11-17 14:59
scrum ...

2016-11-17 14:59
@tuand has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/dcopz4r74jdz7gk52qnkolf5pee.

sanchezl
2016-11-17 15:06
I lost my connection and now call is full. • I have 2 patch sets waiting for reviews: • https://gerrit.hyperledger.org/r/#/c/2043/ • https://gerrit.hyperledger.org/r/#/c/2459/ • I’m working on FAB-890 yesterday, and continue to work on it today, specifically, making sure that the orderer is resilient to the loss os a kafka broker.

vukolic
2016-11-17 15:10
went with @simon through simplebft view change code - among other things we identified a way to eliminate null blocks from simplebft raw ledger

vukolic
2016-11-17 15:10
Simon's CR that implement this should appear soon

vukolic
2016-11-17 15:10
along with some simplifications of view change code

vukolic
2016-11-17 15:10
which should complete sbft epic

kostas
2016-11-17 15:17
> I’m working on FAB-890 yesterday, and continue to work on it today, specifically, making sure that the orderer is resilient to the loss os a kafka broker.

kostas
2016-11-17 15:18
@sanchezl: Are there any blockers in particular here? Something with the sarama API not making sense, or another issue?

sanchezl
2016-11-17 15:20
No blockers.

simon
2016-11-17 15:28
should i accept a new view message even tho i didn't go into view change yet?

simon
2016-11-17 15:41
yep, null requests are gone

simon
2016-11-17 15:42
will push the CR on the train

jyellick
2016-11-17 16:52
> should i accept a new view message even tho i didn't go into view change yet? New view should contain 2f+1 signed view change messages, so if appropriately formed, seems safe to me?

binhn
2016-11-17 18:38
we need to specify who can create/modify a chain on the orderers — i know we have bootstrap config, but how can we do this on-going?

garisingh
2016-11-17 19:14
@binhn - I believe that after any given "channel / chain" is initially created part of that creation involved policy around who will be able to modify any of the various key/value pairs associated with the config

kostas
2016-11-17 19:14
This shouldn't be different than how we specify all the different actions and permissions. Fundamentally a key/value in the ConfigurationTransaction that identifies the root CA certs that are able to issue a new-chain transaction, along with an accompanying modification policy for this key/value pair (how many sigs do you need to modify the list of CA certs that can create new chains).

kostas
2016-11-17 19:14
Basically what Gari said.

binhn
2016-11-17 21:01
@garisingh config tx is bound to a chain, so changes to a config block only affects that chain — what i referred to configuration outside of chains that governs who can create chains

binhn
2016-11-17 21:01
and how can that be modified

garisingh
2016-11-17 22:30
@binhn - it's basically a chicken and egg problem for the initial setup. I suppose the only thing you can do is start the ordering service with one or more admins already there

nikileshsa
2016-11-17 22:57
@nikileshsa uploaded a file: https://hyperledgerproject.slack.com/files/nikileshsa/F348GEBGC/panic__error_unmarshaling_into_structure__1_error_s__decoding______general__has_invalid_keys__profile.sh and commented: I am trying to start a solo orderer using the latest fabric code from master and its throwing a panic as shown in this code snippet. Has anyone seen this before? (it was working as of yesterday).

garisingh
2016-11-17 23:03
I'd suggest pulling the latest and trying again. I just pulled the latest down myself, ran `go build && ./orderer` in the `fabric/orderer` directory and it started up nicely for me

nikileshsa
2016-11-17 23:06
@garisingh ..thanks for trying this out..will let you know as to how it goes..

nikileshsa
2016-11-17 23:12
@garisingh resolved... looks like a build issue in my local... thanks again..

jdockter
2016-11-18 15:18
has joined #fabric-consensus-dev

nvlasov
2016-11-19 01:57
has joined #fabric-consensus-dev

jzhang
2016-11-19 05:25
@kostas @jyellick @garisingh not sure if this is supposed to work, but trying to hook up a network with kafka using the following docker-compose, getting this error: orderer_1 | panic: runtime error: index out of range orderer_1 | orderer_1 | goroutine 16 [running]: orderer_1 | panic(0x8b9880, 0xc42000c0a0) orderer_1 | /opt/go/src/runtime/panic.go:500 +0x1a1 orderer_1 | http://github.com/hyperledger/fabric/orderer/kafka.(*brokerImpl).GetOffset(0xc4201e4550, 0xc4201444c8, 0xc42004b814, 0xc42004b850, 0xc420226088) orderer_1 | /opt/gopath/src/github.com/hyperledger/fabric/orderer/kafka/broker.go:54 +0xc4 orderer_1 | http://github.com/hyperledger/fabric/orderer/kafka.(*clientDelivererImpl).getOffset(0xc420011020, 0xfffffffffffffffe, 0x0, 0x0, 0x0) orderer_1 | /opt/gopath/src/github.com/hyperledger/fabric/orderer/kafka/client_deliver.go:205 +0x1f8 orderer_1 | http://github.com/hyperledger/fabric/orderer/kafka.(*clientDelivererImpl).processSeek(0xc420011020, 0xc4201444c0, 0x0, 0x1) orderer_1 | /opt/gopath/src/github.com/hyperledger/fabric/orderer/kafka/client_deliver.go:162 +0x2a3 orderer_1 | http://github.com/hyperledger/fabric/orderer/kafka.(*clientDelivererImpl).sendBlocks(0xc420011020, 0xba9e20, 0xc42020c0d0, 0xba9e20, 0xc42020c0d0) orderer_1 | /opt/gopath/src/github.com/hyperledger/fabric/orderer/kafka/client_deliver.go:103 +0x56b orderer_1 | http://github.com/hyperledger/fabric/orderer/kafka.(*clientDelivererImpl).Deliver(0xc420011020, 0xba9e20, 0xc42020c0d0, 0xc420011020, 0x0) orderer_1 | /opt/gopath/src/github.com/hyperledger/fabric/orderer/kafka/client_deliver.go:66 +0x79 orderer_1 | http://github.com/hyperledger/fabric/orderer/kafka.(*delivererImpl).Deliver(0xc4200c10e0, 0xba9e20, 0xc42020c0d0, 0x0, 0x0) orderer_1 | /opt/gopath/src/github.com/hyperledger/fabric/orderer/kafka/deliver.go:54 +0xfd orderer_1 | http://github.com/hyperledger/fabric/orderer/kafka.(*serverImpl).Deliver(0xc4200c1100, 0xba9e20, 0xc42020c0d0, 0xc4200254a8, 0xc4200254d0) orderer_1 | /opt/gopath/src/github.com/hyperledger/fabric/orderer/kafka/orderer.go:56 +0x48 orderer_1 | http://github.com/hyperledger/fabric/protos/orderer._AtomicBroadcast_Deliver_Handler(0x8c6100, 0xc4200c1100, 0xba8800, 0xc420084880, 0xc42011f2f0, 0x0) orderer_1 | /opt/gopath/src/github.com/hyperledger/fabric/protos/orderer/ab.pb.go:477 +0xbb orderer_1 | http://github.com/hyperledger/fabric/vendor/google.golang.org/grpc.(*Server).processStreamingRPC(0xc42007e240, 0xba9700, 0xc4201c2360, 0xc4200c42d0, 0xc42011ef00, 0xbc34c0, 0xc42011f2c0, 0x0, 0x0) orderer_1 | /opt/gopath/src/github.com/hyperledger/fabric/vendor/google.golang.org/grpc/server.go:657 +0x6f3 orderer_1 | http://github.com/hyperledger/fabric/vendor/google.golang.org/grpc.(*Server).handleStream(0xc42007e240, 0xba9700, 0xc4201c2360, 0xc4200c42d0, 0xc42011f2c0) orderer_1 | /opt/gopath/src/github.com/hyperledger/fabric/vendor/google.golang.org/grpc/server.go:741 +0xc33 orderer_1 | http://github.com/hyperledger/fabric/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc4201e4430, 0xc42007e240, 0xba9700, 0xc4201c2360, 0xc4200c42d0) orderer_1 | /opt/gopath/src/github.com/hyperledger/fabric/vendor/google.golang.org/grpc/server.go:402 +0xab orderer_1 | created by http://github.com/hyperledger/fabric/vendor/google.golang.org/grpc.(*Server).serveStreams.func1 orderer_1 | /opt/gopath/src/github.com/hyperledger/fabric/vendor/google.golang.org/grpc/server.go:403 +0xa3


kostas
2016-11-19 14:36
@jzhang: Looking into this now.

kostas
2016-11-19 14:39
@jzhang: I'll need details into how exactly you got this error though, as far as I can tell, the latest master builds and runs fine. What is the Deliver request that you're issuing?

jzhang
2016-11-19 20:00
@kostas I’m running fabric-sdk-node/test/unit/end-to-end.js against the network set up above

yacovm
2016-11-20 13:13
Anyone home?

jyellick
2016-11-20 15:24
@yacovm Did you need something?

yacovm
2016-11-20 15:24
`// Creator of the message, specified as a certificate chain`

yacovm
2016-11-20 15:24
that's in SignatureHeader

yacovm
2016-11-20 15:24
what is the real run time type of the bytes field?

yacovm
2016-11-20 15:25
``` message SignatureHeader { // Creator of the message, specified as a certificate chain bytes creator = 1; // Arbitrary number that may only be used once. Can be used to detect replay attacks. bytes nonce = 2; } ```

yacovm
2016-11-20 15:25
in case of multiple orderers

jyellick
2016-11-20 15:25
It is. We should probably update the text of this, essentially this is an 'identity' which the MSP will be able to evaluate.

jyellick
2016-11-20 15:25
It may be a certificate chain, or, it may be some other things the MSP knows how to evaluate

yacovm
2016-11-20 15:25
This is exactly what I don't understand- how do you put there multiple PBFT orderes? in 1 identity

jyellick
2016-11-20 15:27
Oh

jyellick
2016-11-20 15:28
`SignatureHeader` is only intended to support a single identity

jyellick
2016-11-20 15:28
You'll see for instance in `SignedConfigurationItem` ultimately, there is a repeated section of `SignatureHeader`s embedded which is used for multi-sigs

yacovm
2016-11-20 15:29
I did a grep -ri and didn't find SignedConfigurationItem

yacovm
2016-11-20 15:30
where is it?

jyellick
2016-11-20 15:30
The `Envelope` message is the outermost wrapping message for transactions going into the system, and its `creator` field should always be the identity which submitted the message. If the message requires additional signatures, (say endorsements), then this should be done internally.

jyellick
2016-11-20 15:30
`fabric/protos/common/configuration.proto`

jyellick
2016-11-20 15:31
``` $ grep -ril SignedConfigurationItem . ./orderer/common/configtx/configtx_test.go ./orderer/common/bootstrap/static/static.go ./orderer/orderer ./bddtests/common/configuration_pb2.py ./protos/common/configuration.pb.go ./protos/common/common.pb.go ./protos/common/configuration.proto ```

yacovm
2016-11-20 15:31
``` yacovm@yacoVM ~/OBC/shared/gopath/src/github.com/hyperledger/fabric (commCertLearn) $ grep -ri ⁠⁠⁠⁠"SignedConfigurationItem⁠⁠⁠⁠" * yacovm@yacoVM ~/OBC/shared/gopath/src/github.com/hyperledger/fabric (commCertLearn) $ ```

yacovm
2016-11-20 15:32
hmm your command finds it


jyellick
2016-11-20 15:34
You'll see that a `SignedConfigurationItem` embeds a repeated section of `ConfigurationSignature`, each of which contains a signature as bytes, and a `SignatureHeader` (as marshaled bytes)

yacovm
2016-11-20 15:34
yeah I found it, now trying to figure out where is that `SignedConfigurationItem ` is being put

jyellick
2016-11-20 15:35
I was just giving an example of using `SignatureHeader` to express multiple identities (by embedding it multiple times)

yacovm
2016-11-20 15:35
who references it? I don't see any pb.go apart from `configuration.pb.go` referencing it

jyellick
2016-11-20 15:36
Per my grep above, you can find it being used in ``` ./orderer/common/configtx/configtx_test.go ./orderer/common/bootstrap/static/static.go ```

yacovm
2016-11-20 15:37
oh I understand, you're explaining to me the "policy", right?

jyellick
2016-11-20 15:37
More generic configuration, it was just an example of `SignatureHeader` and multiple identities.

yacovm
2016-11-20 15:38
But what I am interested to know is: I get a block multi-signed, how do I validate it now? all I have is what is found in common.proto

yacovm
2016-11-20 15:38
right?

jyellick
2016-11-20 15:39
Sure, so, if you've got a multi-signed block, you'll ask the `policies.Manager` for the block validation policy, then pass the messages, set of signatures, and identities into the `Policy` for validation, and it will validate, or not.

yacovm
2016-11-20 15:40
but I only have 1 message, don't i? ``` // Payload is the message contents (and header to allow for signing) message Payload { // Header is included to provide identity and prevent replay Header header = 1; // Data, the encoding of which is defined by the type in the header bytes data = 2; } ```

yacovm
2016-11-20 15:40
this is encapsulated in ``` // Envelope wraps a Payload with a signature so that the message may be authenticated message Envelope { // A marshaled Payload bytes payload = 1; // A signature by the creator specified in the Payload header bytes signature = 2; } ```

jyellick
2016-11-20 15:47
We have not yet defined the block signature structure. We could define a simple proto like: ``` message BlockSignature { bytes signatureHeader = 1; bytes signature = 2; // The signature over the concatenation of the block header hash and the signature header bytes } ``` Then, define a new envelope header type of `BLOCK_SIGNATURE` which embeds a repeated section of the `BlockSignature` as the payload. Would want to run this scheme by the crypto folks, but I imagine it would work.

yacovm
2016-11-20 15:48
oh I see. I thought it was already inside so I didn't understand what I'm missing.

yacovm
2016-11-20 16:19
was distracted by #fabric-dev , thanks a lot for the explanations @jyellick !

kostas
2016-11-20 22:16


simon
2016-11-21 09:12
hi

niubwang
2016-11-21 10:59
hi gus, does anynoe konw the TPS of the fabric? the current version and the version 1.0

niubwang
2016-11-21 10:59
known...

grapebaba
2016-11-21 11:29
guys, can anyone help explain more about the paragraph 'As the endorser nodes responsible for particular chaincode are orthogonal to the consenters, the system may scale better than if these functions were done by the same nodes'

jyellick
2016-11-21 14:17
@grapebaba In the old system, the nodes that did consensus (ordering) were the same ones that did chaincode execution. In the new system, endorsers run chaincode, and orderers perform consensus to achieve ordering.

simon
2016-11-21 15:01
no scrum?

tuand
2016-11-21 15:01
scrum !


bcbrock
2016-11-21 16:30

hgabor
2016-11-21 16:31
could anybody try to run this?


hgabor
2016-11-21 16:31
I get errors like this: 2016/11/21 15:11:08 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp :6016: getsockopt: connection refused"; Reconnecting to {

hgabor
2016-11-21 16:31
some grpc stuff keeps running in the background

hgabor
2016-11-21 16:31
I need help :S :S

hgabor
2016-11-21 16:32
I think I am closing all the connections after test cases

jamesjong
2016-11-21 19:26
has joined #fabric-consensus-dev

jzhang
2016-11-21 21:38
@jyellick @tuand the latest orderer image fails to start like this:

jzhang
2016-11-21 21:38

bcbrock
2016-11-22 00:49

kostas
2016-11-22 01:24
@jzhang: Still not sure why we're not exploring the Kafka route? I saw this comment about Kafka crashing (http://gerrit.hyperledger.org/r/2657) but it lacks a stack trace, context, and details on how to reproduce. We could have easily spent the day today debugging this.

jzhang
2016-11-22 01:28
@kostas sorry didn’t get a chance to reach out today, was battling a number of other areas. you can use the docker-compose that i posted above to start the 2+1 network and use peer command to submit a deploy request


nits7sid
2016-11-22 06:09
Hi..In the new upcoming architecture can the endorsing peers scale to 3k – 5k? if not what is the maximum limit for  the endorsing peers that can participate?

hgabor
2016-11-22 10:29
please help us by reviewing this chain: https://gerrit.hyperledger.org/r/#/c/2517/

hgabor
2016-11-22 10:29
very urgent and important

simon
2016-11-22 11:11
@hgabor so how do i replicate that test failure?

simon
2016-11-22 11:14
why is grpc on two ports?

simon
2016-11-22 11:16
and the genesis block hashes are still different

simon
2016-11-22 11:17
am i looking at the wrong patchset?

hgabor
2016-11-22 11:22
maybe I forgot to commit the genesis thing. orderer/common/bootstrap/static/static.go should be changed in line 44: chainID, err := primitives.GetRandomBytes(16) you should put a 16 bytes const to there

simon
2016-11-22 11:22
can you please commit what you have

hgabor
2016-11-22 11:22
what two ports? one for AB, one for consensus

hgabor
2016-11-22 11:22
yes

simon
2016-11-22 11:22
why two ports then?

simon
2016-11-22 11:22
both can be on the same grpc

hgabor
2016-11-22 11:23
I was following the ideas of your original 'main app' and decided to have two ports. if that is possible, we can have one

simon
2016-11-22 11:23
because the grpc connect failures are to the grpc port

simon
2016-11-22 11:23
whatever that may be

simon
2016-11-22 11:23
ah not only?

simon
2016-11-22 11:23
what is going on

hgabor
2016-11-22 11:24
you mean?

simon
2016-11-22 11:24
2016/11/22 12:13:05 could not connect to replica 1 (79bb48 [:6004]): grpc: timed out when dialing

simon
2016-11-22 11:24
2016/11/22 12:13:05 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp :6003: getsockopt: connection refused"; Reconnecting to {":6003" <nil>}

simon
2016-11-22 11:24
what's going on, why timed out?

simon
2016-11-22 11:24
why connection refused?

hgabor
2016-11-22 11:26
my theory is that if you run two tests after each other and one has X nodes and other has Y, X > Y, then there are some grpc.Dial's (goroutine) from connectWorker remaining and they try to connect to non-existing nodes

simon
2016-11-22 11:26
aha

hgabor
2016-11-22 11:26
but 6003 is the port of node number 1

hgabor
2016-11-22 11:26
sorry I meant replica


simon
2016-11-22 11:36
look at what goroutines are still around at the end of TestTwoReplicasBroadcastAndDeliverUsingTheSame

simon
2016-11-22 11:37
unless all of this cleans up, i'd expect more test failures

simon
2016-11-22 11:38
probably it is more practical to keep the receiver in the test process and run the replicas in separate processes

hgabor
2016-11-22 11:42
maybe we could analyze that log

hgabor
2016-11-22 11:42
I mean we should concentrate on gotoutines which are related to our code

hgabor
2016-11-22 11:43
e.g. this is not: /usr/lib/go/src/testing/testing.go:583 +0x8d2 it is the test framework I guess

garisingh
2016-11-22 12:16
`TestTwoReplicasBroadcastAndDeliverUsingTheSame` has a logic error - see my comment on patch 9 but basically you need to modify the broadcast call to use peer 1 not peer 9

garisingh
2016-11-22 12:17
right now, the first 2 tests pass for me but the "Bomb" test fails

simon
2016-11-22 12:18
yea

garisingh
2016-11-22 12:19
for the Bomb test do you see that the test fails with 10 messages received every time? I guess it's supposed to be 12

simon
2016-11-22 12:19
there are some other issues that we need to address for testing

garisingh
2016-11-22 12:20
yes - there are background routines (as you mentioned) which continue to run

simon
2016-11-22 12:21
yea we're changing it to use exec + kill

garisingh
2016-11-22 12:23
cool - just thought i'd take a quick look since everyone else ignored Gabor :disappointed:

garisingh
2016-11-22 12:23
seems like you guys have it under control

simon
2016-11-22 12:28
we're working on it

simon
2016-11-22 12:28
somehow the conversation moved away from the channel

niubwang
2016-11-22 12:34
hi gus, does anynoe known the TPS of the fabric?

niubwang
2016-11-22 12:35
100? 10000? or more?

jonathanlevi
2016-11-22 12:35
You know we can’t really ignore @hgabor… he always finds a way to reach out :wink:

jonathanlevi
2016-11-22 12:36
@niubwang It depends on your set up. Less than 10K TPS last time I checked (with Fabric v0.6)

jonathanlevi
2016-11-22 12:36
But whatever number people will give you, v1.0 has so many changes that will have a HUGE impact on performance.

jonathanlevi
2016-11-22 12:37
(the biggest is the “pluggable architecture” where you can specify/switch components using configuration)

jonathanlevi
2016-11-22 12:37
FWIW, I was able to show a stable 2K TPS flow by disabling/simplifying a lot.

simon
2016-11-22 12:37
niubwang: 100-1000

simon
2016-11-22 12:38
depending on the complexity of the chaincode

jonathanlevi
2016-11-22 12:38
Yes, and other factors (e.g., the number of validators, complexity of the TCerts, etc.)

jonathanlevi
2016-11-22 12:39
It is easy (at this point) to “play” with these numbers. Just being frank.

niubwang
2016-11-22 12:41
thanks simon and jonathanlevi

niubwang
2016-11-22 12:41
and the transaction Latency?

niubwang
2016-11-22 12:43
does it depending on the pbft configuration?

simon
2016-11-22 12:45
yes it does depend on that, but not much

simon
2016-11-22 12:45
well, batch size influences it

niubwang
2016-11-22 12:51
the time out of batch is bigger, the TPS is bigger and the Latency is bigger, is it right?

simon
2016-11-22 12:52
yes, in theory

simon
2016-11-22 12:53
i think 100 or 500 batch size already reaches top tps

niubwang
2016-11-22 13:01
the batch size set to 100 or 500 and the batch time out set to 1s or 2s?

simon
2016-11-22 13:02
batch timeout is just to keep the system going if there are not many requests

simon
2016-11-22 13:02
i think 1s or so is reasonable, but it really depends on your requirements

niubwang
2016-11-22 13:05
thanks simon

hmhem
2016-11-23 03:00
has joined #fabric-consensus-dev

ankitkamra
2016-11-23 07:55
has joined #fabric-consensus-dev

hgabor
2016-11-23 14:27
kind of 'next level' stress tests for sbft: https://gerrit.hyperledger.org/r/#/c/2515/13

sanchezl
2016-11-23 15:13
@sanchezl uploaded a file: https://hyperledgerproject.slack.com/files/sanchezl/F35KVP9EV/jim-env.tgz and commented: @jzhang , I was unable to reproduce your issue. Here is a copy of the environment I used and the commands I ran.

muralisr
2016-11-23 19:32
something in orderer is broken recently. its receiving transactions but appearst to be dropping them randomly… by random, I mean the same transaction content does appear to trigger cut/deliver block…. looking at it.

muralisr
2016-11-23 19:32
if anyone has ideas, please do suggest

muralisr
2016-11-23 21:14
Can someone verify this fix https://gerrit.hyperledger.org/r/#/c/2741/ for the above please ? ( @jyellick , @kostas, @hgabor ?)

hgabor
2016-11-23 21:14
Looking

muralisr
2016-11-23 21:15
thanks @hgabor

hgabor
2016-11-23 21:17
So the timer should be nil-ed every time right?

muralisr
2016-11-23 21:17
I think so… and it did work

muralisr
2016-11-23 21:17
also looked at some old code and it was getting niled there

muralisr
2016-11-23 21:18
specifically

muralisr
2016-11-23 21:18
```cutBatch := func() { bs.rl.Append(curBatch, nil) curBatch = nil timer = nil }```

muralisr
2016-11-23 21:18
from the previous version...

muralisr
2016-11-23 21:19
gets called each time the timer pops

muralisr
2016-11-23 21:19
I think its the right thing to do … but then I need your eyes

hgabor
2016-11-23 21:20
But after it is nil, it must be reinitalized because receive and send to nil channel always blocks

muralisr
2016-11-23 21:22
I think the only initialization happens here


muralisr
2016-11-23 21:22
the first batch always gets sent … ie, timer starts out as nil

muralisr
2016-11-23 21:22
and look at that check

muralisr
2016-11-23 21:23
it basically depends upon timer to be nil to start one

hgabor
2016-11-23 21:27
Why len(batches) == 0

hgabor
2016-11-23 21:27
The only thing I don't get yet

muralisr
2016-11-23 21:32

muralisr
2016-11-23 21:33
the first message to be ordered returns “nil, true” which signals that a new batch is getting started

hgabor
2016-11-23 21:33
ok in the meanwhile I had a look at the code and found it out so I think I buy it

muralisr
2016-11-23 21:33
right

muralisr
2016-11-23 21:34
and the select on a nil channel is basiaclly a noop I think

muralisr
2016-11-23 21:34
(it better be :slightly_smiling_face: )

hgabor
2016-11-23 21:34
nil operations block but select does not select that case - I guess

muralisr
2016-11-23 21:34
ok. that might be too

hgabor
2016-11-23 21:35
wait, sorry, one more thing


hgabor
2016-11-23 21:35
line 81

muralisr
2016-11-23 21:36
yes ?

hgabor
2016-11-23 21:36
what happens if // If the message is a valid normal message and does not fill the batch, nil, true is returned

hgabor
2016-11-23 21:36
if this is the case

hgabor
2016-11-23 21:37
nil, true

hgabor
2016-11-23 21:37
len(nil) == 0 ?

muralisr
2016-11-23 21:37
I think nil,true is returned ONLY if it is the first message in the batch ?

muralisr
2016-11-23 21:38
no. I’ll rephrase

hgabor
2016-11-23 21:38
I don't know, I only know what Jason's comments say here: https://gerrit.hyperledger.org/r/#/c/2587/3/orderer/common/blockcutter/blockcutter.go

muralisr
2016-11-23 21:38
right

muralisr
2016-11-23 21:38
that’s what I was going to rephrase …

muralisr
2016-11-23 21:39
len(nil) == 0, yes

hgabor
2016-11-23 21:39
btw if v is nil, len(v) is zero.

hgabor
2016-11-23 21:39
from godocs

hgabor
2016-11-23 21:39
okaaay, sorry

muralisr
2016-11-23 21:39
basically it says keep returning nil and batch up internally

muralisr
2016-11-23 21:39
right

muralisr
2016-11-23 21:39
no worries

muralisr
2016-11-23 21:40
it took me a bit going back and forth too

hgabor
2016-11-23 21:40
so yeah, the timer needs to be "cleared" (set to nil)

hgabor
2016-11-23 21:40
timer channel

muralisr
2016-11-23 21:40
right, seems that’s all needs to be done

hgabor
2016-11-23 21:42
+2 given

hgabor
2016-11-23 21:43
btw isn't there any test for this?

muralisr
2016-11-23 21:50
thanks! we need to check with @jyellick

hgabor
2016-11-23 21:56
we should have a test for this

jonathanlevi
2016-11-23 21:57
It is preferable/desirable to have all these various “paths” covered somehow with unit tests, so that the full “state machine” is being visited.

jonathanlevi
2016-11-23 21:58
It may be difficult to reproduce (at first), but long term, the investment will pay itself pretty quickly.

jonathanlevi
2016-11-23 21:58
Having said that, I have merged it, in the meantime.

jyellick
2016-11-23 22:10
Sorry, just catching up on this

jyellick
2016-11-23 22:10
Yes, timer should be nil-ed after it pops, sorry for the bug

jyellick
2016-11-23 22:10
I can write up a test for this later tonight

muralisr
2016-11-23 22:19
no worries, Jason

muralisr
2016-11-23 22:19
lets chalk it up to any bug of mine waiting out there :wink:

cbf
2016-11-23 23:11
@jyellick a test would be welcomed

stylix
2016-11-24 03:26
Hi, I have some question about accessing consensus state. Currently, I check if the consensus process is done by checking if the chainheight and the currentBlockHash are all the same on all nodes. The question is, can I read the consensus state by using API/SDK? With this, I will know if I send transaction too fast, then be able to throttle the transaction speed,

jyellick
2016-11-24 07:24
@cbf @muralisr As promised, please see https://gerrit.hyperledger.org/r/#/c/2749/ . There was one additional case where the timer was being stopped by `cutBlock` in the old code which got omitted in the refactored code, so I added a fix and a test case for that as well.

jonathanlevi
2016-11-24 07:38
@jyellick: I dread asking what’s your local time!

jonathanlevi
2016-11-24 07:39
Thank you… looks great. Will approve once the tests complete.

hgabor
2016-11-24 08:10
@jyellick I will have a look soon

hgabor
2016-11-24 08:28
done!

drichard
2016-11-28 03:34
has joined #fabric-consensus-dev

drichard
2016-11-28 03:37
Hi - in https://github.com/hyperledger/fabric/blob/master/proposals/r1/Next-Consensus-Architecture-Proposal.md the framework allows for pluggable consensus implementations. Which ones will be included by default? Just PBFT?

hgabor
2016-11-28 08:07
as I know (not the ultimate truth) we are currently implementing simple bft (simplified pbft) and have solo (simple orderer) and kafka. we will also implement a pipelined version of sbft,

hgabor
2016-11-28 09:20
I added some tests to this: https://gerrit.hyperledger.org/r/#/c/2673/

tuand
2016-11-28 14:59
scrum ...

claytonsims
2016-11-28 15:00
hangout?

gennady.laventman
2016-11-28 15:00
link?

2016-11-28 15:00
@tuand has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/bwahgnabtfh3pi42d32kmvjygqe.

gennady.laventman
2016-11-28 15:01
@yacovm - I will talk with you later

hgabor
2016-11-28 15:04
there's a meeting going on so I would prefer not using my mic but I join with audio only :slightly_smiling_face: with @vukolic we are (still) hunting sbft bugs. @vukolic added some improvements for sbft. they got already merged.

hgabor
2016-11-28 15:06
@tuand

tuand
2016-11-28 15:08
got it

vukolic
2016-11-28 15:09
sorry I cannot join - on another call

vukolic
2016-11-28 15:09
@hgabor summarized it well

hgabor
2016-11-28 15:09
@tuand sorry for not answering but listening :slightly_smiling_face:

tuand
2016-11-28 15:13
np @hgabor ! between scrum and this channel, we're good

hgabor
2016-11-28 15:14
btw please have a look at this CR: https://gerrit.hyperledger.org/r/#/c/2673/ I added tests as Chris suggested

adc
2016-11-28 15:18
Hi All, apart sbft, who else is generating signatures in the consensus package?

hgabor
2016-11-28 15:19
@jyellick may know something about that

jyellick
2016-11-28 15:31
@adc Solo and Kafka will each attach a signature to each block created, attesting to its validity

jyellick
2016-11-28 15:33
Because Solo and Kafka are both not BFT, a single signature is all that is needed, whereas with SBFT we will need f+1 (though we will likely have and encode 2f+1) signatures attesting to the blocks validity

vukolic
2016-11-28 15:38
makes sense

adc
2016-11-28 15:38
I see, the code for signing is already in place, I guess. Right?

adc
2016-11-28 15:46
So, @jyellick, the only sign calls I see are in the sbft package

jyellick
2016-11-28 15:47
Ah, so, I was giving an 'eventually' statement. Today, there's no signing or signature validation done anywhere (outside of sbft)

jyellick
2016-11-28 15:47
We need to hook in the BCCSP code and start utilizing it. We've left some (probably not adequately documented) plugpoints for it, but today, other than sbft, we don't do signing.

adc
2016-11-28 15:48
got it. Actually, one should hook the MSP directly

adc
2016-11-28 15:48
the MSP will then know what to do with the BCCSP

jyellick
2016-11-28 15:49
Ah, okay. So how do I know which MSP to pick?

adc
2016-11-28 15:49
really good question :slightly_smiling_face:

adc
2016-11-28 15:49
we need a configuration, actually one can follow what has been done for the peer

adc
2016-11-28 15:50
which is still a work in progress but is a first step any way

jyellick
2016-11-28 15:50
My impression had been, that for instance when I get a signature, I can simply hand the signing identity, the signature, and the bytes the signature was over to 'something' and get back the validity

adc
2016-11-28 15:50
So, the peer initializes the MSPManager. The assumption, for now, is that there is only one MSP with a default identity

adc
2016-11-28 15:51
yes, that's correct

adc
2016-11-28 15:51
actually, what would be nice to have is something like @yacovm has done for the gossip module

adc
2016-11-28 15:52
That is to have an intermediate interface that exposes Sign and Verify plus some other methods that are needed in the specific case

adc
2016-11-28 15:52
and have us implementing it using the MSP

adc
2016-11-28 15:52
in this way, the orderers would be kind of pluggable also with the respect of how signatures are generated and verified. Or if you want, kind of independent from the MSP

adc
2016-11-28 16:17
I would like to ask also the status of the configuration, if possible. https://gerrit.hyperledger.org/r/#/c/2677/ needs review

weeds
2016-11-28 16:37
Quick chat with Tuan- they are working on getting multi-channel working with Kafka

tuand
2016-11-28 17:13
@kostas @jyellick need you guys to comment on changeset https://hyperledgerproject.slack.com/archives/fabric-consensus-dev/p1480349822004450

tuand
2016-11-28 17:14
everyone else welcomed to review as well but kostas/jason/me need to review in next couple days

jyellick
2016-11-28 17:24
Looking now

kostas
2016-11-28 19:02
@sanchezl @jzhang Where are we with the issue that Jim reported seeing last week?

kostas
2016-11-28 19:04

sanchezl
2016-11-28 19:04
I was unable to reproduce. I posted my environment here.

kostas
2016-11-28 19:05
@sanchezl: Right, I remember that. I am checking to see if there was a follow-up.

xixuejia
2016-11-29 02:11
+5GY+6

grapebaba
2016-11-29 10:24
guys, on v0.6

grapebaba
2016-11-29 10:24
[consensus/pbft] recvViewChange -> WARN e845 Replica 2 already has a view change message for view 1 from replica 1 [consensus/pbft] recvViewChange -> WARN 15af5 Replica 1 already has a view change message for view 1 from replica 1

grapebaba
2016-11-29 10:25
what cause this warning

grapebaba
2016-11-29 10:26
how will this diSappear

hgabor
2016-11-29 10:31
is that a fatal problem if a replica has multiple view changes from another? @vukolic

hgabor
2016-11-29 10:31
I don't think

vukolic
2016-11-29 10:31
it is not a problem

vukolic
2016-11-29 10:31
in principle view change msgs need to be retransmitted

vukolic
2016-11-29 10:32
@jyellick may remember more of the actual implementation

garisingh
2016-11-29 12:22
looking at https://gerrit.hyperledger.org/r/#/c/2673/ ..... (and I probably missed this somewhere in a doc) couple things I am wondering about: 1) Generally speaking, DeliverResponse sends a Block and I believe that currently nether Solo nor Kafka orderer "sign" this block 2) Block contains a metadata field - is this where the "proof" for the Block is supposed to go - i.e. is this where the signature info would actually go? 3) In the case of the CFT-based orderers, my assumption is that you'd only need one signature from the "shim" which "delivers" the message. Correct? 4) Specific to 2673, for SBFT is the "proof" actually multiple signatures (e.g. 2f+1 SBFT nodes)? Or I am just totally off here?

hgabor
2016-11-29 12:33
2) I think yes. if it should not be there then it could be accepted as a temporary solution at least (as we have no better place yet) 3) in 2673 I only store multiple signatures (as you said, and I also store some kind of header which is sbft internal) - but yes it can be called "the proof"

garisingh
2016-11-29 12:38
(I think I picked up the "proof" term from something @jyellick used to say :wink: )

hgabor
2016-11-29 12:50
yeah and rawledger code also uses that term

jyellick
2016-11-29 14:15
@grapebaba When a replica believes the view should change, but has not received a new view message, it periodically (once a second) resends its view change message (because it might have been lost due to network failure). The warnings you see are benign and should probably be at a lower log level.

jyellick
2016-11-29 14:17
1) Correct, though this is in plan to add 2) Yes, and potentially some other info, for instance the gossip folks would like an attestation of the latest config block there 3) Yes 4) We technically only need f+1, though in reality we will likely have 2f+1 so will likely just include all

jyellick
2016-11-29 14:18
And yes, @garisingh is right, originally, we had a single byte field for 'proof', but, metadata was more generic and could be a superset of proof

garisingh
2016-11-29 14:22
thanks @jyellick - so don't we have a "minor" issue with SBFT as it is currently implemented in that what gets passed around is a "batch" and not a "block" - meaning the current SBFT signatures are not actually on the Block itself? (I think that's what your comment was getting at for 2673?)

jyellick
2016-11-29 14:23
Correct, sbft needs to be converted wholesale to the common data structures

garisingh
2016-11-29 14:23
e.g SBFT batch -> cb.Block?

jyellick
2016-11-29 14:23
They're quite similar to the sbft structures, so I don't think this is an impossible task, but it will obviously be invasive

jyellick
2016-11-29 14:23
Right

garisingh
2016-11-29 14:23
cool. just trying to keep up :wink:

elli
2016-11-29 14:32
@tuand, @garisingh, @jyellick a few more changesets were submitted here to simplify the config files


elli
2016-11-29 14:33
Also a more visual representation of the peer init config schema is included in the attachment:


garisingh
2016-11-29 14:36
@jyellick @hgabor - shall we approve 2673 and then do another changeset to retrofit with the common structures? trying to stay within the spirit of incremental changes :wink:

jyellick
2016-11-29 14:37
@garisingh You can see that is my opinion (per the +2)

garisingh
2016-11-29 14:37
oh

garisingh
2016-11-29 14:38
you were a step ahead of me :wink:

hgabor
2016-11-29 14:38
we are currently debugging sbft witk @vukolic - as far as the network tests begin to work (https://gerrit.hyperledger.org/r/#/c/2515/) I will try to use the common structures in sbft

hgabor
2016-11-29 14:39
and there are dozens of other things to do, solo and kafka are several miles ahead of sbft

hgabor
2016-11-29 14:39
in the usage of common structures and functionalities

garisingh
2016-11-29 14:39
no worries

hgabor
2016-11-29 14:39
e.g. cutter, manager and I don't remember :smile:

vukolic
2016-11-29 14:40
of course they are more advanced - since we are building a spaceship and not a bicycle/car

vukolic
2016-11-29 14:40
:wink:

hgabor
2016-11-29 14:40
@jyellick promised that he will take a photo of the blackboard next time if he draws a diagram about the (planned) system

jyellick
2016-11-29 14:42
Yes, will do. Very close to having an end to end flow of multi-chain including chain creation, so trying to stay heads down on the code but will be sure to document when finished.

hgabor
2016-11-29 14:45

hgabor
2016-11-29 14:55
pls +2 it again

garisingh
2016-11-29 14:56
@hgabor - will do

hgabor
2016-11-29 14:58
@jyellick one more please

jyellick
2016-11-29 15:03
@hgabor Done

hgabor
2016-11-29 15:03
thx

anton
2016-11-29 15:07
has joined #fabric-consensus-dev

grapebaba
2016-11-29 15:16
@jyellick: when can the resend terminate?

kostas
2016-11-29 16:59
@grapebaba: When/if the network eventually switches to a new view that is equal to or higher than the one the complaining peer asks for.

kostas
2016-11-29 17:01
If the rest of the network operates just fine in an "earlier" view, that other peer will be resending forever.

jyellick
2016-11-29 18:36
@grapebaba See https://jira.hyperledger.org/browse/FAB-707 for a more thorough discussion

divyank-sk
2016-11-29 18:51
has joined #fabric-consensus-dev

grapebaba
2016-11-30 01:06
thanks @jyellick @kostas

lovesh
2016-11-30 06:44
has joined #fabric-consensus-dev

adc
2016-11-30 08:06
Hi @kostas, is the kafka package supposed to generate any signature?

adc
2016-11-30 08:06
I have also noticed that solo doesn't generate any signature too

vukolic
2016-11-30 08:39
@adc solo/kafka will both generate a signature as this is needed for gossip

vukolic
2016-11-30 08:39
I do not think they do yet - but they will

adc
2016-11-30 08:39
okay, I was looking for that information to understand how to integrate the MSP

adc
2016-11-30 08:40
so far, only sbft generates signatures

adc
2016-11-30 08:40
but in this change-set https://gerrit.hyperledger.org/r/#/c/2605/, related to orderer bootstrapping, sbft cannot be chosen

garisingh
2016-11-30 10:12
@adc - I don't think we want to integrate MSP with the ordering nodes until we get it straightened out on the peer side. But I agree the first place to start should be with signing blocks as this SHOULD be very similar to providing peer identity and signing endorsement responses. As a matter of fact, other than the fact they sign something different, not sure why things would not be almost identical

elli
2016-11-30 13:59
+1

kostas
2016-11-30 14:02
@adc Marko is right. We should also be signing but don't do it yet.

adc
2016-11-30 14:05
perfect. So, when time will come let's coordinate on this :slightly_smiling_face:

vukolic
2016-11-30 14:24
attn maintainers


hgabor
2016-11-30 14:32
grpc advice needed:

hgabor
2016-11-30 14:33
if I have a GRPC Server Implementation with a function A() and A() is called from client side (from a grpc client, client.A() ) my server code for A() is called. is that done in the main goroutine?

yacovm
2016-11-30 14:35
of course not

yacovm
2016-11-30 14:36
that would mean that you could only process 1 invocation of gRPC service serially, because your statement will hold for any invocation of A()

yacovm
2016-11-30 14:37
You can easily, btw know the goroutine it is called from. Go to `gossip/util/misc.go` -> PrintStackTrace and change *true* to *false*, and then in the gRPC server-side method invoke util.PrintStackTrace() and print it and you'll see from where it's invoked

yacovm
2016-11-30 14:43
I checked for you, the goroutine that serves (at least a stream request) is created in the following way: ``` go func() { defer wg.Done() s.handleStream(st, stream, s.traceInfo(st, stream)) }() ``` in `grpc/server.go`

hgabor
2016-11-30 14:46
@yacovm thanks, that is what I thought too but somehow sbft related code reached a dead lock related to this and I am thinking what caused that

yacovm
2016-11-30 14:47
well for deadlock "debugging" you can print all goroutine(s) when you think there is a deadlock using util.PrintStackTrace()

yacovm
2016-11-30 14:47
it's in `gossip/util/misc.go`

yacovm
2016-11-30 14:47
it can be helpful because it sometimes shows you if a goroutine is waiting on a lock, or not and you perhaps can deduce which goroutines are waiting on which locks

yacovm
2016-11-30 14:52
I wish golang could have something like *jconsole* in java though... it's so comfortable

tuand
2016-11-30 19:13
@elli @adc @aso @jeffgarratt @muralisr @binh this is how I see the genesis block create tool working after our discussion this morning https://jira.hyperledger.org/browse/FAB-665?focusedCommentId=19911&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-19911

aso
2016-11-30 19:13
has joined #fabric-consensus-dev

tuand
2016-11-30 19:14
Let's continue using FAB-665 for comments


umasuthan
2016-12-01 09:59
Can someone provide clarification on how the Validating Leader is selected or elected during consensus ? Thank you

nits7sid
2016-12-01 10:27
Hi. I am using fabric v 0.6. I have set up a network of 4 peers and 1 CA. I am trying to connect the Non-Validating peer to the network, When i try to use the /chain Rest API on the NVP it gives me {"Error":"No blocks in blockchain."}

ankitkamra
2016-12-01 10:29
@nits7sid are all vp's and nvp's are on same network ?

nits7sid
2016-12-01 10:40
@ankitkamra yes

ankitkamra
2016-12-01 10:40
i am getting the same problem

ankitkamra
2016-12-01 10:41
what i am doubting is that every peer is listening on 172.17.0.2

ankitkamra
2016-12-01 10:41
they are not able to connect through physical ip of machine

ankitkamra
2016-12-01 10:42
so nvp must forward invoke/query request to vp and it tries to connect 172.17.0.2 that is it gets connected to self

nits7sid
2016-12-01 10:42
what is the command you using to run a NVP?

ankitkamra
2016-12-01 10:42
peer node start

ankitkamra
2016-12-01 10:43
just change in peer/core.yaml, set value of validator=false

ankitkamra
2016-12-01 10:43
what about you

nits7sid
2016-12-01 10:43
same

ankitkamra
2016-12-01 10:44
so i am expecting this problem

ankitkamra
2016-12-01 10:44
what do you say ?

garisingh
2016-12-01 11:16
@nits7sid - a non-validating peer is really no more than a glorified "wallet" for submitting transactions to validating peers. moving forward, we don't even have non-validating peers in the v1 architecture. What is your goal in terms of using an NVP?

ankitkamra
2016-12-01 11:22
@garisingh in my case i want to connect a third party system to nvp so that third party may not be able to deploy chaincode as well he has replica of that data in his local network and he can read it fastly

nits7sid
2016-12-01 11:23
@garisingh: my goal is to connect some peers to the network and get the blockchain data.

garisingh
2016-12-01 11:27
@ankitkamra - not going to work - the only way to query data is via chaincode and chaincode is not deployed to non-validating peers

garisingh
2016-12-01 11:28
@nits7sid - why not just get the blockchain data from an application using the SDK? Why does it need to be a NVP?

ankitkamra
2016-12-01 11:29
@garisingh one question from my side. Is this possible, we give permissions to particular users so that only they can deploy chaincode

ankitkamra
2016-12-01 11:29
means is there any access right management available ?

garisingh
2016-12-01 11:30
not in v0.6

nits7sid
2016-12-01 11:30
@garisingh: can I add new peers to the network dynamically in v 0.6?

ankitkamra
2016-12-01 11:30
then it may be security issues that if i give a peer to third party, he may deploy chaincode

ankitkamra
2016-12-01 11:31
and read the data written by another chaincode too ?? am i right or not ?

garisingh
2016-12-01 11:31
@nits7sid - no - you cannot add validating peers dynamically in v0.6

garisingh
2016-12-01 11:32
@ankitkamra - you can build access control into your chaincode methods themselves to prevent certain clients from invoking or querying a specific chaincode or specific functions on that chaincode

nits7sid
2016-12-01 11:32
@garisingh: through SDK ?

ankitkamra
2016-12-01 11:33
@garisingh yes that we can do. suppose if i have written some data with chaincode1. can we read that data with chaincode2?

garisingh
2016-12-01 11:33
you can't really stop someone from deploying chaincode in fabric v0.6 without putting some type of "proxy" layer in between

ankitkamra
2016-12-01 11:33
with same key

garisingh
2016-12-01 11:35
@ankitkamra - the data itself is scoped at a chaincode level so if you restrict access to chaincode within your functions then you would not be able to invoke those functions from other chaincode unless you also had permission to access the initial chaincode

garisingh
2016-12-01 11:36
@nits7sid - sorry - had a type - you CANNOT dynamically add validating peers to the network in v0.6

nits7sid
2016-12-01 11:39
@garisingh: I have a network of 5 VP's. I connected 4 peers first and then deployed a chaincode. Later I connected one more peer to the network. I noticed tht the new peer didn't sync the blocks. i am using v0.6

garisingh
2016-12-01 11:42
that's the expected behavior. there's a slight change that if you stop your original 4 peers, modify the config to say that there are 5 peers and restart that things *might* work, but v0.6 does not support dynamically adding a peer to a network

nits7sid
2016-12-01 11:43
@garisingh: I started initially wiht N=5 then started 4 peers initially

nits7sid
2016-12-01 11:45
@garisingh so the only way to add new peers is to stop the peers, change config and then restart them?

garisingh
2016-12-01 11:49
hmmm - if you specified N=5 to start with, you should be able to start up the 5th peer after the fact. But you'd have to keep running transactions for the 5th peer to notice it is behind and catch up. As I recall, you'd probably need to generate between 10 and 20 blocks of transactions

nits7sid
2016-12-01 11:52
@garisingh: so I should be generating 10-20 transactions into the network ?

garisingh
2016-12-01 12:00
something like that - I forgot the default block size and timeout, but I think its something low like 2 transactions / block?

nits7sid
2016-12-01 12:04
@garisingh: thanks..i wil try that out

ankitkamra
2016-12-01 13:22
@garisingh thanks for your support

umasuthan
2016-12-01 13:44
@garisingh, You mentioned in your earlier post that only way to query data is via chaincode and chaincode is not deployed to non-validating peers. However, the spec (https://github.com/hyperledger/fabric/blob/master/docs/protocol-spec.md#222-multiple-validating-peers) says, "Non validating peers (also known as peers) receive user transactions on behalf of users, and after some fundamental validity checks, they forward the transactions to their neighboring validating peers. Peers maintain an up-to-date copy of the blockchain, but in contradiction to validators, they do not execute transactions (a process also known as transaction validation).” If it has an upto data copy of the blockchain, we should be able to query data, right? or my understanding is wrong?

tuand
2016-12-01 15:00
scrum ...

2016-12-01 15:00
@tuand has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/r6xhuz4wsvfxnarynsnr25zaxqe.

tuand
2016-12-01 16:59
@sanchezl trying to run kafka docker-compose ( from 2459 change set ) on OSX/vagrant ... failing with `ERROR: In file './docker-compose.yml' service 'version' doesn't have any configuration options. All top level keys in your docker-compose.yml must map to a dictionary of configuration options.`

sanchezl
2016-12-01 16:59
you might have an old docker-compose installed

tuand
2016-12-01 16:59
i think it's because my docker-compose version is backlevel ? 1.5.2 , need to be at 1.6 ?

sanchezl
2016-12-01 16:59
Yes

tuand
2016-12-01 17:00
you running inside vagrant ?

sanchezl
2016-12-01 17:00
Yes, but it should work outside also.



tuand
2016-12-01 17:01
i'm running inside vagrant and getting the error .... time to rebuild my image :slightly_smiling_face:

sanchezl
2016-12-01 17:01
Do you have that icon

sanchezl
2016-12-01 17:01
ahh, yes… you’ll need to pick up the updated compose

tuand
2016-12-01 17:02
thx ! onward ...

garisingh
2016-12-01 17:19
@tuan - best to move to compose 1.8

ynamiki
2016-12-02 00:56
@ynamiki has left the channel

nits7sid
2016-12-02 12:24
Hyperledger fabric v 0.6 do not synchronize block data after adding a new vp peer to the existing network ?

garisingh
2016-12-02 12:56
well as mentioned before, v0.6 does not claim to support adding a new validating peer to an existing network. I believe that you had tried to start a 4 peer network configured with N=5 and then start the 5th peer after the first 4. In theory, if you then invoke a bunch of transactions the 5th peer *might* catch up. I don't know that we ever tested / tried that scenario

bcbrock
2016-12-02 14:41
Someone told me the other day that the minimum blob size for ordering was looking like 100KB, up to 1MB. Can anyone confirm/explain/deny this rumor?

garisingh
2016-12-02 14:44
sorry - that was an error on my part in a conversation with kostas . the intent was to talk about the average range for the largest stuff the we've seen rather than the average size of transactions overall

bcbrock
2016-12-02 14:46
Thanks Gari, that is reassuring.

bcbrock
2016-12-02 14:48
To date I have only tested the ordering service with synthetic blobs. If anyone has run real workloads with endorsements and signatures I would be very interested to know the range of blob sizes that are being handled.

kostas
2016-12-02 14:51
I would second that, esp. as we're adding cut-by-filesize block logic, and we'll need to have sensible defaults there.

kostas
2016-12-02 14:52
I know that for @jzhang 's demo we had to bump the max. filesize to 10 (15?) MB though. What is that transaction that's so big?

jzhang
2016-12-02 14:57
@kostas it’s the chaincode deploy, not sure exactly how big but it’s more than 1MB (kafka default)

sanchezl
2016-12-02 14:57
Yes, it was a deploy tx , and it was 7.3 MB

jzhang
2016-12-02 14:58
it’s already much smaller than v0.6 because we don’t have to send the whole fabric source tree any longer

kostas
2016-12-02 14:58
And IIRC we're doing it this way (and not just point to a Github repo and build from there, same as you'd do with a Dockerfile for instance) because we want to support private chaincodes?

jzhang
2016-12-02 14:59
even then the deploy could be arbitrarily large if the chaincode has external dependencies that the fabric-ccenv base image doesn’t already have

kostas
2016-12-02 15:00
Interesting. I am not familiar with the mechanics of this at all, but my initial thought would be the same; why aren't all of these (even the dependencies) instructions that the receiving peer would parse, i.e. why isn't a deploy transaction the equivalent of a Dockerfile.

jzhang
2016-12-02 15:00
that’s right, don’t want to make assumption about Peer’s access to a devops service (aka github or any other source code repos)

sanchezl
2016-12-02 15:01
Would it be possible to fragment a tx?

jzhang
2016-12-02 15:02
i remember seeing a slideshare from a linkedin engineer about fragmenting messages on the kafka side

jzhang
2016-12-02 15:02
i can dig that up

jzhang
2016-12-02 15:02
after my scrums call

garisingh
2016-12-02 15:23
I guess there's the fine line of trying to figure out the right defaults for these types of things, but in the end they will likely need to be dynamically configurable

garisingh
2016-12-02 15:25
Now - there's really no need for these massive chaincode deploy archives. we should really start measuring some file sizes there.

muralisr
2016-12-03 15:11
@jyellick @manish-sethi @kostas a Block from orderer is per chain (all the transactions in the block belong to the chain), shouldn’t the blockHeader contain the chainID ?

jyellick
2016-12-03 15:12
Since all blocks contain at least one tx, and each tx contains the chain id, we can get by without one

muralisr
2016-12-03 15:12
sure

muralisr
2016-12-03 15:12
just checking

muralisr
2016-12-03 15:12
otherwise we’ll have to invent another layer on top of block...

jyellick
2016-12-03 15:13
I'd entertain adding it, but don't have a compelling reason off the top of my head

kostas
2016-12-03 15:13
Yeah, do you have a specific use-case Murali?

muralisr
2016-12-03 15:14
I’m basically removing the “DefaultChain” from chaincode framework… its a peg everything hangs on today. We need to remove it to make way for multichain

muralisr
2016-12-03 15:15
so i’m now in the noopscommitter client which was hardcoding DefaultChain

muralisr
2016-12-03 15:15
I want to use the chainid in the block

muralisr
2016-12-03 15:16
I can get by with the chainide from the envelope of a TX

muralisr
2016-12-03 15:16
of each TX

muralisr
2016-12-03 15:16
but what if there’s no TX in the block ?

kostas
2016-12-03 15:16
When would that be the case?

muralisr
2016-12-03 15:17
probably invalid… but it highlights the point

jyellick
2016-12-03 15:17
The orderer will never send you a block without a tx, or I can't think of a way this would be valid

muralisr
2016-12-03 15:17
the tx id is block wide but is embedded in each tx

muralisr
2016-12-03 15:17
ok

kostas
2016-12-03 15:18
I guess in a BFT scenario you _could_ have an orderer that sends you a bad block, but you should be able to discard this right away.

muralisr
2016-12-03 15:18
sounds good

kostas
2016-12-03 15:18
(Nothing will check out, no f+1 sigs, etc.)

muralisr
2016-12-03 15:19
thanks much! just checking...

garisingh
2016-12-03 21:32
hey folks - please check out https://jira.hyperledger.org/browse/FAB-1255 Given some of the oddities of how we are going to have to deal with identity and trust certificates and also that we want to make TLS a first class citizen, I thought it would make sense to create a standard "secure GRPC server". Just started on it but should have basic functions working in a day or so. I also want to add in some interceptors and metrics as well If there are specific setting that you need to pass in (e.g. max message size, etc), please add them to the JIRA. Other option is to say this is a stupid idea :wink:

muralisr
2016-12-03 21:36
@garisingh before we do that :wink: https://gerrit.hyperledger.org/r/#/c/2961/

muralisr
2016-12-03 21:36
looking at 1255 now

muralisr
2016-12-03 21:42
I see what you are saying

muralisr
2016-12-03 21:45
I think that’s a sound idea… let me get it straight… we have few servers (event, peer, orderer ) and GRPC services hanging off them, The servers are got some basic properties we could standardize upon (especially TLS handling ). Just like we have comm which handle connections with TLS etc across, we could do with a Server package ?

garisingh
2016-12-03 21:49
correct - so comm would have server and client (I'm thinking to create a new client rather than modifying the current connection stuff)

garisingh
2016-12-03 21:50
``` func NewSecureGRPCServer(config *ServerConfig) (*grpc.Server, net.Listener, error){} //Configuration information for a GRPC server type ServerConfig struct { //Listen address for the server specified as hostname:port Address string //Certficate presented by the server for TLS communication ServerCertificatePEM []byte //Key used by the server for TLS communication ServerKeyPEM []byte //List of certificate authorities to be used to authenticate clients if client authentication is required ClientRootPEM [][]byte } ```

garisingh
2016-12-03 21:51
and of course additional properties TBD will be added to ServerConfig

muralisr
2016-12-03 22:59
sounds fine

muralisr
2016-12-03 23:01
from a chaincode pov (client) the onlyt hing is that it needs a msg size. I’ve seen for large messages between chandcode and peer, the GRPC transport may not handle … not sure what the limits are here but that would be one thing I think

tuand
2016-12-05 15:00
scrum ...

2016-12-05 15:00
@tuand has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/w4fwecg6ardwlnagyjbfklgpu4e.

tuand
2016-12-05 15:24
working on FAB-665, FAB-666 boostrapping of orderer - reviewing one last time changeset http://gerrit.hyperledger.org/r/2963 and then will update the genesis block create tool to create the blocks as defined

tuand
2016-12-05 15:28
also working on FAB-701 authenticating connection btw orderer and kafka network ... TLS setup will be via .properties files but we will need to map forward configuration from orderer to sarama config struct ... one outstanding issue is how to add/remove orderers as I haven't seen if kafka can dynamically revoke a certificate yet

hgabor
2016-12-05 16:40
@tuand @vukolic @jyellick about SBFT, what I will have to do: 1) finish networked stress tests with Marko (we still need to fix a bug) 2) remove network stress tests and sbft tests from orderer's "short test set" (e.g. if !Short() return : https://golang.org/pkg/testing/#Short) 3) rewrite sbft and app to use the common stuff

vukolic
2016-12-05 16:42
re 3) what sort of sbft rewrite this involves

vukolic
2016-12-05 16:42
?

hgabor
2016-12-05 16:43
about 1) we agreed that until censorship prevention component is implemented we simply add delays (sleep N seconds) to the tests to give nodes enough time to connect (as there were cases when the replicas were unable to connect to each other quickly enough and some requests got 'censored'/left out) - are you OK with this?

hgabor
2016-12-05 16:44
@vukolic 3) involves the usage of the common stuff @jyellick introduced (e.g. common proto structures, block cutter, common deliver/broadcast etc lot of stuff)

hgabor
2016-12-05 16:44
@jyellick help me out with some details please :smile:

vukolic
2016-12-05 16:45
ok let's do that carefully :slightly_smiling_face:

vukolic
2016-12-05 16:45
in principle it is not clear that every consensus impl needs to do stuff in the same way

vukolic
2016-12-05 16:46
imagine 3rd party consensus

jyellick
2016-12-05 16:46
My suggestion would be first, to migrate from the `batch` structure used internally to the common `Block` structure, they are nearly the same, and ultimately, sbft will need to deliver `Block`s

vukolic
2016-12-05 16:47
why is this not called Batch?

jyellick
2016-12-05 16:47
Because it contains a hash chain

jyellick
2016-12-05 16:47
Or rather, because it forms one

vukolic
2016-12-05 16:47
well Batches do as well

vukolic
2016-12-05 16:47
calling it Batch would be in spirit of NCAP

vukolic
2016-12-05 16:47
Block is for Validated Ledger

jyellick
2016-12-05 16:48
But for implementation, if we wish to have re-use of components, we must have a common data type

vukolic
2016-12-05 16:48
agree - I am talking about the name :slightly_smiling_face:

jyellick
2016-12-05 16:49
We cannot convert from a `Batch` datatype to a `Block` datatype without breaking the hash chain (unless they are 100% identical, which at that point, having two different data structures seems silly)

vukolic
2016-12-05 16:49
let me rephrase

vukolic
2016-12-05 16:49
I propose to refactor Block into Batch

vukolic
2016-12-05 16:49
rename

vukolic
2016-12-05 16:49
as for current Batch and Block compatibility I will take a look

jyellick
2016-12-05 16:50
But this is all an internal implementation detail? If you wish, we can `type batch cb.Block`. I think I must be missing your point

vukolic
2016-12-05 16:51
what I am saying is that the thingy should be *called* Batch

vukolic
2016-12-05 16:51
not Block

vukolic
2016-12-05 16:51
to be in spirit of the NCAP

jyellick
2016-12-05 16:52
But then we will be using a `Batch` datastructure to store blocks. And the only benefit is for people who actually inspect the code (what we call what a user thinks of as a 'batch' internally should have no impact on that user).

vukolic
2016-12-05 16:53
no - we use it to store batches :slightly_smiling_face:

vukolic
2016-12-05 16:53
raw ledger has batches

vukolic
2016-12-05 16:53
validated ledger has blocks

vukolic
2016-12-05 16:53
I can also do the renaming in the NCAP

vukolic
2016-12-05 16:53
but it may be rooted already in people's brains this way

vukolic
2016-12-05 16:53
it is in mine :slightly_smiling_face:

jyellick
2016-12-05 16:53
I guess I would argue that that's a clarification we made to make things clear in the NCAP to the high level user, but in real implementation, we have blocks on both sides. It was just confusing to a user to call them both blocks.

jyellick
2016-12-05 16:54
Fundamentally, at the orderer side, we _are_ building a blockchain.

jyellick
2016-12-05 16:54
All transactions in the blockchain will be valid, per the blockchain rules.

vukolic
2016-12-05 16:54
anyway - naming needs to be in sync

vukolic
2016-12-05 16:54
one way or another

jyellick
2016-12-05 16:54
It just so happens that the peer side is going to take this perfectly valid orderer blockchain, and apply a different set of rules to it, to form another, perfectly valid blockchain.

vukolic
2016-12-05 16:54
either renaming code

vukolic
2016-12-05 16:54
or renaming ncap

jyellick
2016-12-05 16:55
If we rename it in code... then the naming on the blockchain side is right, and then on the peer side it is wrong.

kostas
2016-12-05 16:55
https://gerrit.hyperledger.org/r/#/c/697/ (discussion starting from 08-26 16:01 is a deja-vu of this)

vukolic
2016-12-05 16:56
no its not - in NCAP this is Batch on both sides

vukolic
2016-12-05 16:56
anyway

vukolic
2016-12-05 16:56
we are wasting too much slack for this

vukolic
2016-12-05 17:10
Another option is to rename NCAP to Blocks and VBlocks

vukolic
2016-12-05 17:10
I do not really care

vukolic
2016-12-05 17:10
Except that naming must be consistent

hgabor
2016-12-05 17:13
yeah the code and the document should have the same names, that is true

dave.enyeart
2016-12-05 17:30
@vukolic v1 won’t even have validated ledger, the committer will have a raw ledger that includes an indicator of which trans are valid or not. I don’t think we’d want to have a v1 without blocks :slightly_smiling_face: , therefore I’d agree NCAP should be updated to call them Blocks and VBlocks.

klorenz
2016-12-05 20:00
has joined #fabric-consensus-dev

kostas
2016-12-05 23:30
A heads up that I'll be changing the test chain ID string (currently set to `**TEST_CHAIN_ID**`) as asterisks are not allowed in topic names in Kafka: https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/common/Topic.scala#L29

kostas
2016-12-05 23:31
This has implications on the allowed chain IDs in general.

kostas
2016-12-05 23:32
Do we provide a function that generates acceptable allowed chain IDs and check the given chain ID during the config-tx filtering in ingress?

muralisr
2016-12-05 23:33
gee thanks :slightly_smiling_face:

muralisr
2016-12-05 23:33
just kidding

kostas
2016-12-05 23:33
Do we let all characters be fairplay, then strip out the invalid ones if the backing consenter is Kafka?

muralisr
2016-12-05 23:33
can you do me a favor ? ….also change in core/util ?

kostas
2016-12-05 23:34
(@muralisr: Sure, will do.)

muralisr
2016-12-05 23:34
that should take care of CLI e-2-e

muralisr
2016-12-05 23:34
thanks, Kostas

kostas
2016-12-05 23:35
The downside of the first option is that we're tied too much to the way Kafka does things. It also won't play nicely if we want the chain ID to be the hash of certain data as @elli had suggested at one point.

kostas
2016-12-05 23:36
The downside of the second option is that we're losing the one-to-one mapping between the original chain ID and the one that Kafka holds. May be insignificant (and nothing that can't be taken care of with persisting some metadata), but doesn't quite feel right.

muralisr
2016-12-05 23:36
what do you think of keep it simple and do “standard” chars for chainID across the board ?

kostas
2016-12-05 23:37
So that's closer to the first option right?

muralisr
2016-12-05 23:37
`Do we provide a function that generates acceptable allowed chain IDs and check the given chain ID during the config-tx filtering in ingress?` - basically saying perhaps we enforce a “valid chain ID” at fabric level ?

kostas
2016-12-05 23:38
Yes.

muralisr
2016-12-05 23:38
I think if we do that, we at least standardize upfront and keep it sane/simple

kostas
2016-12-05 23:38
To me that works. I do remember reading in a slide that we wanted the chain ID to be the hash digest of certain data (Elli's suggestion, or I may be wrong), but I don't know if there's a special reason behind this. If there's not, we're good.

muralisr
2016-12-05 23:39
right, I don’t know all the implications of that… however if we allow arbitrary bytes, we run into the problem of needing to map to various underlying technologies

kostas
2016-12-05 23:39
We could always do what I used to do when chainID was a byte slice.

kostas
2016-12-05 23:39
I would hex-encode it. But the length limit still applies. (249 characters max.)

kostas
2016-12-05 23:40
All safe characters :simple_smile:

muralisr
2016-12-05 23:40
right. worst cast we could do that

kostas
2016-12-05 23:40
In fact, I'll do that for now, and wait until we reach a —wait for it— consensus before I change the test chain ID.

muralisr
2016-12-05 23:40
I like it… let the mapping be handled at the higher levels , in-your-face, rather than burying deep

muralisr
2016-12-05 23:41
for the test chainID , see no harm in moving to sane chars

muralisr
2016-12-05 23:41
at least will unblock you

kostas
2016-12-05 23:42
I'm good either way. At least now that I know the issue, I can work around it. Thanks Murali.

muralisr
2016-12-05 23:43
thank you!

elli
2016-12-06 08:04
Hi, @kostas, @muralisr; correct there was a thought of having chainid as the hash of the genesis config content (assuming that it is equipped with a nonce or sequence number of requested chains per participant). This just had the advantage, that it could be pre-computed by the application that submits the config block, that could henceforth bind a chain to its configuration data. Proper encoding would avoid the use of non-acceptable chars.

elli
2016-12-06 08:05
But it is true that it is also not the friendliest of identifiers. So a user-friendly name may nevertheless be needed :slightly_smiling_face: Up to you.

adc
2016-12-06 08:26
Hi @tuand, what happened with https://gerrit.hyperledger.org/r/#/c/2605/? Are you still working on it?

tuand
2016-12-06 13:47
hi @adc, i'm redoing the genesis block create tool and 2605 as well ... anything you need me to do there ?

muralisr
2016-12-06 14:03
@elli chain ID is going to be in every proposal, every transaction… would the length be any concern ?

elli
2016-12-06 14:12
Hi @muralisr: hm, hash output can be 128/256 bits. Do you consider that be too long?

muralisr
2016-12-06 14:12
not sure… just asking, that’s all

elli
2016-12-06 14:12
@tuand, just pushed a cleaner version of the protos/golang schemas for config in a new changeset. https://gerrit.hyperledger.org/r/#/c/3015/

elli
2016-12-06 14:13
Just fyi :slightly_smiling_face:

muralisr
2016-12-06 14:13
I wouldn;t say its long at all but does depend upon the average size

tuand
2016-12-06 14:13
thanks elli !

tuand
2016-12-06 14:30
@elli @adc @jyellick do you want to put ChainInitConfig inside one configurationItem ? or have separate configurationItems for MSPManager, gossipAnchors etc ... ? probably easier to treat it as one big configurationItem ?

jyellick
2016-12-06 14:30
I thought we had decided on individual config items per MSP

elli
2016-12-06 14:31
Ok, so it would be

elli
2016-12-06 14:31
- one config item per MPS

elli
2016-12-06 14:31
(including the entire config of that msp: what is described in the schema as MSPConfig)

elli
2016-12-06 14:32
- one config item for chain configuration including: readers, writers, admins, and available MSPs?

jyellick
2016-12-06 14:32
That seems too coarse to me

elli
2016-12-06 14:32
But for the beginning it may be fine

elli
2016-12-06 14:32
since we only have one admin

elli
2016-12-06 14:33
or one admin list

jyellick
2016-12-06 14:33
I was about to comment on the changeset, but I think we should take this much more incrementally

jyellick
2016-12-06 14:33
Let's start simply by defining the MSPs, and a policy for who is allowed to `Broadcast`, I assume this is equivalent to writers?

elli
2016-12-06 14:34
Correct.

elli
2016-12-06 14:34
But why would we need to define a policy if this is a hardcoded one?

jyellick
2016-12-06 14:34
I still don't understand what it means for the policy to be hardcoded

elli
2016-12-06 14:34
That is if we say whenever one is to check if X has permission to broadcast check that list.

jyellick
2016-12-06 14:35
We can have the tool automatically generate this policy, but if we don't have to define it statically in the code, I don't think we should

elli
2016-12-06 14:35
So if you assume that you have a policy t out of N identities do X

elli
2016-12-06 14:35
then there is a part of the policy that can be hardcoded, and that is that it is always 1 out of N

elli
2016-12-06 14:35
(for example)

elli
2016-12-06 14:36
or always N out of N

elli
2016-12-06 14:36
(AND case)

elli
2016-12-06 14:36
then the part of the policy that you can modify is only the list of certificates

elli
2016-12-06 14:36
(that would implicitly define N)

jyellick
2016-12-06 14:38
But you always have to specify who the `N` are. We don't have to expose the fact that the tool also generates a `1 out of` or an `N out of`, but in implementation, it's much easier to have a single case than two.

elli
2016-12-06 14:39
Correct

elli
2016-12-06 14:39
And that is part of the MSP config.

elli
2016-12-06 14:39
(under rootCAs)

adc
2016-12-06 14:39
Hi @tuand, I was just interested on the status of the change-set because it is a good starting point to include the MSP as soon as it is ready.

tuand
2016-12-06 14:40
cool @adc ! I shall ping you in about 8 hours with new news :slightly_smiling_face:

jyellick
2016-12-06 14:41
I just feel like we are trying to tackle too much here. What is the minimum set that is required to instantiate an MSP (manager?) and validate a signature against that MSP? We can even ignore policies for the moment.

adc
2016-12-06 14:41
great :slightly_smiling_face:

elli
2016-12-06 14:42
Correct. But RootCAs are needed for the validation of certificates. @aso is already working on that.

elli
2016-12-06 14:43
For the admin case/reconfiguration agreed,that this can wait.

aso
2016-12-06 14:43
@jyellick yeah, I'll try to have a change-set for us to stare at later today

adc
2016-12-06 14:45
@jyellick we are already going in the direction of minimalism :slightly_smiling_face:

elli
2016-12-06 14:46
+1

jyellick
2016-12-06 14:46
@aso Can you point me to the data structure you will need the orderer to feed you? There is https://gerrit.hyperledger.org/r/#/c/3015/, but it contains a lot of structures, is it `MSPDesc`?

jyellick
2016-12-06 14:47
And is there a good sample which provides real certs etc. that we can use in unit tests?


aso
2016-12-06 14:49
but @elli should review them once more

tuand
2016-12-06 14:49
so ignoring policies for a bit , for chainInitConfig, i think then we want separate configurationItems for MSPManager(including readers/writers/admin) , OrderingClientConfig, OrderingServerConfig, GossipAnchors ?

aso
2016-12-06 14:50
and yes, in the change set I'll produce a sample config file that will allow a proper local MSP to be created

aso
2016-12-06 14:50
I have half a mind to also write a small sample program to generate those json files from a cert, a CA cert and a keypair

aso
2016-12-06 14:51
that should be helpful for @tuand I guess

jyellick
2016-12-06 14:54
@elli I see there is the notion of an `MSPGroup`, is this what we are settling on for our first class identity citizen? So policies should be written against `MSPGroup`s (supplied as the identity bytes to the signature validation)?

elli
2016-12-06 14:55
@aso: i already have test files for that.

elli
2016-12-06 14:55
that is code, that can be used on that end. I just removed them from the changeset to avoid confusion.

aso
2016-12-06 14:56
awesome! That'll save me some time! :slightly_smiling_face: Do they already read certs/keys files in pem format from command line?

adc
2016-12-06 14:58
if parsing methods are needed, the crypto/primitives packages has many :slightly_smiling_face:

elli
2016-12-06 14:59
No, they assume you have some string version of the certs, and creates a sample config file with these...

aso
2016-12-06 14:59
super

aso
2016-12-06 14:59
ah ok, so I'll write that part then

aso
2016-12-06 14:59
shouldn't be too much work anyway

jyellick
2016-12-06 14:59
@aso You said you've taken https://gerrit.hyperledger.org/r/#/c/3015/1/config-schemas/chain-genesis-config-schema.go and made some changes, but surely you don't need all of those structures, only some subset of them?

elli
2016-12-06 14:59
Instantiates the objects and put them together into that json config.

elli
2016-12-06 15:00
Hm, it is actually this one for the local setup


elli
2016-12-06 15:01
Used for orderers and/or peers. Indeed a small subset of the previous one.

aso
2016-12-06 15:01
@jyellick this is still WIP, but the peer config schema looks a bit like this ``` type PeerLocalConfig struct { LocalMSP *MSPManagerConfig `json:"msp-config"` BCCSP *BCCSPConfig `json:"bccsp-config"` } type MSPManagerConfig struct { Name string `json:"name"` MspList []*MSPConfig `json:"msps"` } type MSPConfig struct { Type ProviderType `json:"type"` Config []byte `json:"config"` } type FabricMSPConfig struct { Name string `json:"id"` RootCerts [][]byte `json:"rootcas"` Admins [][]byte `json:"admins"` RevocationList [][]byte `json:"revoked-ids,omitempty"` SigningIdentity *SigningIdentityInfo `json:"signer,omitempty"` } type SigningIdentityInfo struct { PublicSigner []byte `json:"pub"` PrivateSigner *KeyInfo `json:"priv"` } type KeyInfo struct { KeyIdentifier string `json:"key-id"` KeyMaterial []byte `json:"key-mat"` } type BCCSPConfig struct { Name string `json:"name"` Location string `json:"location"` } ```

aso
2016-12-06 15:02
(I've removed comments for the sake of brevity, they are of course still in the code)

elli
2016-12-06 15:02
Actually not

elli
2016-12-06 15:02
the signingidentity is part of the MSPConfig

elli
2016-12-06 15:02
actually not

elli
2016-12-06 15:03
it isnot the latest, and FabricMSPConfig does not have any signing identity

jyellick
2016-12-06 15:03
The `MSPManager` or whatever bit of code that will do the signature validation has what signature for its constructor? Or how do these structures get put into the MSP?

aso
2016-12-06 15:04
@jyellick @elli please note that this is WIP and I'll add you all as reviewers and we can and will change things; I just went with this schema for now *only* in order to have some code out

aso
2016-12-06 15:05
> The `MSPManager` or whatever bit of code that will do the signature validation has what signature for its constructor? Or how do these structures get put into the MSP? not sure I follow...

jyellick
2016-12-06 15:06
@aso I see a lot of structure definitions. But for example, say, given a slice of `MSPConfig`, I take it, and invoke `NewMSPManager(configSlice)` and I get back an instance of `MSPManager` which I can use to validate signatures.

jyellick
2016-12-06 15:06
(Obviously this is a hypothetical flow, but I'm looking for what the real one is)

muralisr
2016-12-06 15:07
(no hurry @aso … don’t want to interrupt the flow… but when you get to it `RevocationList [][]byte` in `FabricMSPConfig` … that’s not a static one time thing is it ? if CRLs are going to be sent out periodically do we need a separate “update” structure for that ?)

aso
2016-12-06 15:09
@jyellick yes, the code comes with some factories for MSP managers

aso
2016-12-06 15:09
there is a singleton for the "local" msp, and then a factory to get a manager for a given chain

aso
2016-12-06 15:10
@elli @jyellick we should work together to define the exact arguments to those factories

jyellick
2016-12-06 15:10
We should be able to do signature validation without the 'local' msp, no?

elli
2016-12-06 15:11
nope

aso
2016-12-06 15:11
that's tricky.. the local msp acts on static information that never changes (its config comes from a config file after all)

aso
2016-12-06 15:11
so I think the local msp should either never be used to validate signatures, or only be used at chain creation/join time if needed (but @elli yesterday told me not even this is required)

jyellick
2016-12-06 15:13
I seem to recall that it was expressed on a call, but the local MSP and 'chain MSP's (for lack of a better term) seem like very different beasts

adc
2016-12-06 15:13
@aso, I also need the factories at the orderers :slightly_smiling_face:

aso
2016-12-06 15:17
> I seem to recall that it was expressed on a call, but the local MSP and 'chain MSP's (for lack of a better term) seem like very different beasts the only difference I understand is that the local MSP can dispense signing identities whereas chain MSPs can't and won't

jyellick
2016-12-06 15:17
My assumption is that we instantiate an MSP manager per chain. Then, on chain reconfiguration (or genesis) we're going to either feed a set of 'updated MSP definitions' to the manager, or, we simply instantiate a new one by re-invoking the constructor with this set. Does this sound reasonable? Or am I missing something?

aso
2016-12-06 15:17
but one can very well use the same interfaces and config schema, using `omitempty`

aso
2016-12-06 15:18
> My assumption is that we instantiate an MSP manager per chain. Then, on chain reconfiguration (or genesis) we're going to either feed a set of 'updated MSP definitions' to the manager, or, we simply instantiate a new one by re-invoking the constructor with this set. Does this sound reasonable? Or am I missing something? It sounds very reasonable

aso
2016-12-06 15:18
the manager *for a chain* is created once, setup once and refreshed any number of times

elli
2016-12-06 15:19
@jyellick: localMSP or signerMSP is indeed only instantiate to offer the peer signing abilities. (we discussed this in the call)

elli
2016-12-06 15:19
MSP description in the chain had more the meaning of a verifierMSP used to verify signatures coming from tx/proposal creators.

elli
2016-12-06 15:20
But in reality, as @aso, and @tuand pointed out in the respective changeset one could have a configuration structure that includes all fields

elli
2016-12-06 15:20
tha tis signingMSP = verifierMSPconfig + singing identity

jyellick
2016-12-06 15:21
@aso Okay. So, for other pieces of config, we already have a `configtx.Manager` which will essentially call `Begin`, then for as many times as there are config items of a given type call `Propose(item) error` and eventually call `Rollback` or `Commit`. So, I imagine we have a set of `MSPConfig`s, which we pass through this process to handle the manager updates.

elli
2016-12-06 15:21
and for verifiers have an empty signing identity.

aso
2016-12-06 15:23
> @aso Okay. So, for other pieces of config, we already have a `configtx.Manager` which will essentially call `Begin`, then for as many times as there are config items of a given type call `Propose(item) error` and eventually call `Rollback` or `Commit`. So, I imagine we have a set of `MSPConfig`s, which we pass through this process to handle the manager updates. This aspect isn't 100% clear to me yet. I discussed it with @elli the other day and iirc, we could have 2 types of `MSPConfig`s items that you could pass to mspmanager.refresh(): - a whole new config that wipes out the old one - an item that just adds a cert to a revocation list

aso
2016-12-06 15:23
it may be more complex than this though, and @elli surely knows more

elli
2016-12-06 15:24
actually, i do not think we discussed mspmanager config

elli
2016-12-06 15:25
We discussed individual msp config.

jyellick
2016-12-06 15:25
Per the structures, I'm not seeing why we really need an MSP manager config. I think all I see there is a name? And I'm not sure why it needs one?

elli
2016-12-06 15:25
aha, you only need an msplist indeed.

elli
2016-12-06 15:26
this is just to identity config info associated to your msp-manager, that could later be extended with a separate admin.

elli
2016-12-06 15:26
But now you are right. We could have only the msplist...

elli
2016-12-06 15:27
But, then for each msp in the list, is the internals of thatMSPthat know how to handle/manage/evaluate reconfiguration requests.

jyellick
2016-12-06 15:27
I would like that very much because it would simplify things significantly for the implementation

elli
2016-12-06 15:27
+1

tuand
2016-12-06 15:30
so we should have configurationItems for msplist, ordererClientConfig, ordererServerConfig, gossipAnchorList, readers, writers, admins ?

jyellick
2016-12-06 15:33
@tuand Possibly? I'm not ready to call it until we actually have concrete implementations for all of it. For instance, there is config already defined in our static bootstrapper stuff which is not enumerated in 3015. I think we need to take it one item at a time

tuand
2016-12-06 15:35
ok, I'll start with a genesis block that contains just one configurationItem for the msplist and we can go from there

jyellick
2016-12-06 15:35
Well, the MSP list should have one config item per MSP

tuand
2016-12-06 15:35
then @adc and @aso can connect their code

jyellick
2016-12-06 15:35
And implicitly, all the MSPs defined would form the list

tuand
2016-12-06 15:36
sure ... so genesis block will have n configurationItems, one per MSPDesc

jyellick
2016-12-06 15:36
Right

jyellick
2016-12-06 15:37
I can quick push a changeset which steals the `MSPDesc` proto message that we can both base our work off of

tuand
2016-12-06 15:37
configItem.Value = marshalled MSPDesc, key=? type=?

jyellick
2016-12-06 15:38
key is a good question, I'd assume something ripped out of the MSPDesc, like the org name? I'll define a new Type called `MSP` we can use

aso
2016-12-06 15:39
@tuand which schema will you use to represent the MSP config?

tuand
2016-12-06 15:40
@tuand uploaded a file: https://hyperledgerproject.slack.com/files/tuand/F3BAKGJTF/-.php and commented: @aso MSPDesc as defined in 3015

tuand
2016-12-06 15:41
@jyellick key = MSPIdentifier ? agree with type=MSP

jyellick
2016-12-06 15:42
@aso or @elli can probably speak to this better, but I'd think `MSPIdentifier` might be a little human unfriendly?

aso
2016-12-06 15:43
got it. 2 Qs for @tuand : - why protobuf and not json? - is it okay if the schema undergoes a few minor changes in the change-set I'm about to push? I've had to make a few changes to have a simpler implementation of the managers/msps (but of course everything can be changed again in the review phase)

jyellick
2016-12-06 15:44
@aso I'm probably to blame for the protobuf push. Essentially because you can marshal protobuf to/from JSON if you want, and we already use protobuf for everything else.

tuand
2016-12-06 15:44
json should be only for genesis block create, internal code should standardize on protobuf

tuand
2016-12-06 15:45
just let me know if you make a patch

aso
2016-12-06 15:45
but anyway changes that affect the MSPs are always generated by some external entity (e.g. a CA advertising a revoked cert)

aso
2016-12-06 15:45
and so as far as the core/config code is concerned, you'll have a byte array with a key name

aso
2016-12-06 15:45
how it's marshalled shouldn't really matter

aso
2016-12-06 15:46
or am I missing something?

jyellick
2016-12-06 15:48
> - is it okay if the schema undergoes a few minor changes in the change-set I'm about to push? I've had to make a few changes to have a simpler implementation of the managers/msps (but of course everything can be changed again in the review phase) Maybe you can keep @tuand in the loop on this? Assuming things don't change too radically I wouldn't think this should cause big problems though

aso
2016-12-06 15:48
of course, I'll add him as a reviewer

aso
2016-12-06 15:49
I'm running the last few tests and will push asap

tuand
2016-12-06 15:49
that's your first mistake alex :wink:

aso
2016-12-06 15:49
:smile:

aso
2016-12-06 15:50
so.. about json vs protobufs: can we agree that MSP can choose whatever encoding schema/marshalling it wants for its config? This should be largely transparent as far as the core is concerned - it only affects COP, MSP and presumably the tool that @tuand is building

tuand
2016-12-06 15:51
the genesis block creation is a manual step , some admin has to create a json file to input to the tool, it's then easier to map to the protobuf and use that everywhere in our code, no need to reparse

aso
2016-12-06 15:51
wait but you already have a json file; why not take the marshalled json as a string and send it around?

aso
2016-12-06 15:52
if not for the whole genesis block, at least for the msp config

aso
2016-12-06 15:52
again, you should treat it as the opaque byte array it is

tuand
2016-12-06 15:55
i can do that ... so the genesis block will have one configuration item per MSPDesc, value = json of MSPDesc as []bytes, type=MSP, key=?

jyellick
2016-12-06 15:56
@aso I don't understand why we are bifurcating the fabric on JSON/protobuf

jyellick
2016-12-06 15:56
Why not use protos internally everywhere?

aso
2016-12-06 15:57
aside from it being elegant in some abstract sense, does it really matter how a []byte is marshalled internally?

aso
2016-12-06 15:59
and btw, json was chosen because it's already used by COP

jyellick
2016-12-06 15:59
I would argue that consistency is a good thing, that a developer who approaches a fabric data structure doesn't have to guess at its encoding. JSON is a good human readable format, which protobuf marshals readily to and from. And every component must already speak protobuf

aso
2016-12-06 16:01
ok, I'll stick with json for now and will change it later (we can create a JIRA item so that we don't forget)

aso
2016-12-06 16:01
is that acceptable also for @tuand ?

jyellick
2016-12-06 16:01
I already see proto definitions for the data structures?

jyellick
2016-12-06 16:02
I assume the interface to the MSP takes golang structures, not JSON?

aso
2016-12-06 16:02
that change set doesn't contain any running code; the one I'm about to push does.. it took the size and shape of an oil-spill and at this point I want it out there asap so that I don't have to rebase a gazillion times :wink:

aso
2016-12-06 16:02
> I assume the interface to the MSP takes golang structures, not JSON? correct

jyellick
2016-12-06 16:04
Then this seems pretty easy to me, how the MSP manager gets those structures should be pretty irrelevant from an MSP perspective? We can bolt on whatever marshaling scheme seems appropriate (which, I'd suggest is protobuf, since all the other structures in the config are protobuf).

jyellick
2016-12-06 16:04
> so that I don't have to rebase a gazillion times :wink: Understood, I know how that goes...

tuand
2016-12-06 16:05
@aso @elli send me a sample json file when you're ready

aso
2016-12-06 16:05
I'll send you the schema and the sample file in pvt

jyellick
2016-12-06 16:08
@aso @elli @adc https://gerrit.hyperledger.org/r/#/c/3019/ Here is a shamelessly lifted minimal MSP definition for @tuand and I to work off of until the real one is finalized.

adc
2016-12-06 16:09
that's minimalism actually :slightly_smiling_face:

aso
2016-12-06 16:09
if possible, I'd appreciate it if we could work out of this one for now

aso
2016-12-06 16:09
that would avoid another tiny oil-spill for me :wink:


aso
2016-12-06 16:10
it is neither final nor ideal. It does have a big plus though, which is that it comes with running code :wink:

adc
2016-12-06 16:10
+1

jyellick
2016-12-06 16:11
@aso Works for me, I'll go ahead and update the proto to match, though I'll leave out the signing identity, as this is fixed for genesis/config material

aso
2016-12-06 16:11
if possible leave it in there and do some magic with omitempty

aso
2016-12-06 16:11
this way we have a single definition

aso
2016-12-06 16:11
for all MSPs

aso
2016-12-06 16:11
the local one has a signing identity, the chain one doesn't (it's empty hence omitted)

aso
2016-12-06 16:11
does that work?

jyellick
2016-12-06 16:11
Okay, I can add it in, it just pulls in yet another struct, was trying to keep it small

jyellick
2016-12-06 16:12
But that's fine, whatever is the path of least resistance

aso
2016-12-06 16:12
right, we can remove it later thx!

aso
2016-12-06 16:12
and you get some karma points because you saved me from another rebase :stuck_out_tongue:

jyellick
2016-12-06 16:12
Haha, I'll take them!

jyellick
2016-12-06 16:21
@aso https://gerrit.hyperledger.org/r/#/c/3019/ protos to match your structs, I don't think there are any glaring omissions (also @tuand)

aso
2016-12-06 16:22
super, thanks! yeah, that looks okay

adc
2016-12-06 16:24
even though less minimal :slightly_smiling_face:

adc
2016-12-06 16:26
the principal of minimality in crypto is actually quite challenging to achieve. In a lot of cases, there are components in a crypto scheme that are there just to be able to carry the proof of security. It is more related to our ignorance

jyellick
2016-12-06 16:29
Understood, I'm sure I'm being a little obnoxious with the "let's start small", sorry about that. I just can't wrap my head around the whole thing at once and if we can split things into smaller more easily digestible (but working) pieces, it makes my head hurt a little less.

elli
2016-12-06 16:30
@jyellick this is though only for peer config.

elli
2016-12-06 16:30
For orderer setup one would need to pass the orderer chain config. Adding @tuand, @binhn.

jyellick
2016-12-06 16:32
@elli I'm not sure what you mean? Why is that needed to make the MSP manager work?

tuand
2016-12-06 16:33
I'd say let's start with what we have now and get orderer+msp and peer+msp running ? I can add other config very quickly after that

elli
2016-12-06 16:33
@tuand: +1

elli
2016-12-06 16:36
@jyellick: Well my understanding was that the orderer boots with being provided with some local information (local MSP + key material, + consensus local info), and with the orderer channel genesis config (that includes verifier MSP configuration, readers and writers of the orderer chain, other orderers identities, etc).

jyellick
2016-12-06 16:37
@elli Understood that this will not be a complete end to end finished configuration. Just trying to add the MSP piece to the chain config, and we can add the remaining config later.

elli
2016-12-06 16:38
ok, got it

elli
2016-12-06 16:39
So, but this one includes peer local setup only though: https://gerrit.hyperledger.org/r/#/c/3019/2/protos/common/msp/configuration.proto

elli
2016-12-06 16:39
correct?

jyellick
2016-12-06 16:40
This is supposed to include only the structs that @aso needs in order to stand up a working MSP manager for the MSP's to be embedded in the chain config

jyellick
2016-12-06 16:47
@aso Maybe I missed it, but do you have any static valid `MSPConfig` that we can incorporate for unit tests?

jyellick
2016-12-06 16:48
(And, if you could post your gerrit changeset here once pushed, I'd appreciate it)


aso
2016-12-06 16:49
I've pushed (I wasn't able to verify that all tests work but at least we can start the review process)

aso
2016-12-06 16:49
> do you have any static valid `MSPConfig` that we can incorporate for unit tests? I do here https://gerrit.hyperledger.org/r/#/c/3025/1/msp/peer-config.json

aso
2016-12-06 16:50
that contains a root CA, a cert for signing and a keypair

jyellick
2016-12-06 16:52
Great, thanks!

aso
2016-12-06 16:52
oh, one thing that is worth pointing out


aso
2016-12-06 16:53
`MSPConfig.Config` is of type `[]byte`

aso
2016-12-06 16:53
this way, different implementation of MSP can use whatever internal format they want

aso
2016-12-06 16:54
so the fabric msp implementation in its setup can do ``` func (msp *bccspmsp) Setup(conf1 *MSPConfig) error { // given that it's an msp of type fabric, extract the MSPConfig instance var conf FabricMSPConfig err := json.Unmarshal(conf1.Config, &conf) if err != nil { mspLogger.Errorf("Failed unmarshalling fabric msp config, err %s", err) return fmt.Errorf("Failed unmarshalling fabric msp config, err %s", err) } ```

jyellick
2016-12-06 17:03
Right

jyellick
2016-12-06 17:03
Makes sense

jyellick
2016-12-06 19:33
@jyellick uploaded a file: https://hyperledgerproject.slack.com/files/jyellick/F3BE7JD0U/mutlichain_orderer_diagrams.pdf and commented: Per request of @hgabor here is a rough sketch of how the orderer common components for broadcast/deliver/multichain work.

hgabor
2016-12-06 21:09
@jyellick thanks :slightly_smiling_face:

muralisr
2016-12-06 21:19
thanks @jyellick … timely :slightly_smiling_face:

scottz
2016-12-06 21:55
General API question: Will the ordered transactions/batches be timestamped in v1.0? And is this info for each transaction retrievable by the users? Are there different answers for different consensus algorithms?

jyellick
2016-12-06 22:40
@scottz I think this may be a better question for @garisingh @vukolic @elli @adc @aso, but, we have discussed doing transaction filtering by both time and epoch. Clients should set a timestamp on all transactions, and, when the orderer creates the block, its signature will also be over a `ChainHeader` which includes a timestamp. I think it's yet to be decided exactly what guarantees those timestamps have though, if any.

scottz
2016-12-06 23:09
ok, so it sounds like the capability is there but implementation is yet TBD. One of the features in the plans for R3's CORDA, being developed by/for banking industry, is precise timestamping. I presume they want to be able to query when a given transaction occurred, and be able to tell if it was before or after another transaction. But it is not clear to me what that means to hyperledger/fabric. Is it the timestamp when it was (a) requested by client or SDK, or (b) client SDK receives event notification, or (c) when LeaderPeer receives it, or (d) when Peers determine it is "validated" or committed, or (e) when delivered by orderer service. I guess if we want to support banking industry too, and to write any system/behavior tests, then we may need to get more detail about this timestamping requirement.

muralisr
2016-12-06 23:19
@jyellick any retries at any level in the orderer… on some error (say in consensus) the block gets dropped...

jyellick
2016-12-06 23:29
@muralisr is this a question? In general, once a transaction is ack-ed, it should be "in consensus", the fault tolerance is then determined by the consensus algorithm

muralisr
2016-12-06 23:29
sorry yes, meant to be a question

muralisr
2016-12-06 23:30
to take an example… suppose a batch of txs is handed to kafka for ordering and it returns a failure

muralisr
2016-12-06 23:31
is there a notion of restarting the consenses with that batch or will that batch of txs dropped

jyellick
2016-12-06 23:32
In the Kafka case, and @kostas can correct me if I'm wrong, but we won't ack until the transaction is guaranteed to be ordered

jyellick
2016-12-06 23:34
If something like a configuration transaction invalidates that transaction, then it could be dropped

muralisr
2016-12-06 23:37
understood.. I kinda missed the point. every client transaction handed to consensus (not a batch)

muralisr
2016-12-06 23:37
thanks Jason

muralisr
2016-12-06 23:40
so @jyellick `but we won't ack until the transaction is guaranteed to be ordered` - would that translate to `if we ack that the tx is guaranteed to be ordered, it is guaranteed to be in a block` ?

jyellick
2016-12-07 01:41
@muralisr Short version: almost always yes. Long version: Once a transaction has been ack-ed it will have the opportunity to be included in a block. If it is 'valid' after ordering according to the raw chain ingress rules (not VSCC obviously), namely the signer is still authorized to transact on the chain, then it will be included

muralisr
2016-12-07 01:43
I think that answers my question @jyellick … but I it was targetted at the batch coming out of the orderer and not the block finally created… let me rephrase `if we ack that the tx is guaranteed to be ordered, it is guaranteed to be in the batch delivered by orderer` ?

muralisr
2016-12-07 01:44
I think what you are say is “always, yes"

jyellick
2016-12-07 01:45
Not quite, there is prefiltering the orderer does, to make sure only authorized users submit transactions to the raw ledger, which can cause it to drop transactions

muralisr
2016-12-07 01:45
ok

jyellick
2016-12-07 01:46
Before consensus, prefiltering is not deterministic

muralisr
2016-12-07 01:46
understood

muralisr
2016-12-07 01:46
basically the ack is “submitted for ordering"

jyellick
2016-12-07 01:46
So, the ack is a best effort pre filtering approval

jyellick
2016-12-07 01:46
Right. It's been submitted for ordering and it's valid according to the current config

muralisr
2016-12-07 01:47
got it

muralisr
2016-12-07 01:47
thanks much!

jyellick
2016-12-07 01:47
Happy to help!

kostas
2016-12-07 03:02
Not sure I agree with the ACK discussion. (Assuming that by ACK we refer to the SUCCESS `BroadcastResponse` that the ordering service sends back.)

kostas
2016-12-07 03:03
> In general, once a transaction is ack-ed, it should be "in consensus"


kostas
2016-12-07 03:05
I had raised the following point during the review of the changeset that introduced the common broadcaster: https://gerrit.l.org/r/#/c/2763/3/orderer/common/broadcast/broadcast.go@141

kostas
2016-12-07 03:08
We adopted the exitChan modification proposed there, but the problem (as my comment noted) remains, no?

kostas
2016-12-07 03:09
Long story short, we're using a buffered channel as a queue for incoming messages.

kostas
2016-12-07 03:09
Whenever we can successfully inject an incoming tx to this queue, we send back an ACK.

kostas
2016-12-07 03:11
But we may queue up several messages (and send several ACKs) before the first `Enqueue()` call fails. (And the broadcaster then returns, as it should.)

kostas
2016-12-07 03:14
So we may ACK and the TX may be dropped and it *won't be* because of a filtering failure, as the discussion seems to suggest.

vukolic
2016-12-07 09:46
@jyellick - what is blockcutter doing when the batch is filled but is queuing as the previous batch is still being processed?

vukolic
2016-12-07 09:47
and related to that - @kostas - does Kafka ordering service pipelines batches/blocks or not?

kostas
2016-12-07 12:44
@vukolic: It doesn't. Ordering and then applying config and/or writing to the ledger happens in the same thread sans buffers in between. https://github.com/kchristidis/fabric/blob/fab-819-preview/orderer/kafka/main.go#L214

kostas
2016-12-07 12:45
If this turns out to be a bottleneck, pushing the batches to a queue and having a separate goroutine handle the `WriteBlock()` call would allow for some parallelization, but seems a bit premature to optimize at this point.

vukolic
2016-12-07 12:46
Ok - that said we'll need it eventually

kostas
2016-12-07 12:46
Understood. (And agreed.)

vukolic
2016-12-07 12:46
Now vack to blockcutter

vukolic
2016-12-07 12:46
If the queue due to absence of pipelining is growing

vukolic
2016-12-07 12:47
Does blockcutter cut the queue into say 25 queued blocks

vukolic
2016-12-07 12:47
Or it lets the block consisting of queued tx grow to be one big block?

kostas
2016-12-07 12:49
It would be the latter but at any rate the block wouldn't exceed `batchSize` messages at a time.


kostas
2016-12-07 12:50
It exposes an `Ordered()` and `Cut()` method. As a consensus plugin you invoke the first one sequentially to get the transactions ordered. You may additionally have to invoke the `Cut()` method if, say, your batchTimer has expired and you want to _force_ the cutting of a block.

vukolic
2016-12-07 12:51
Hm so kafka orders tx and then

vukolic
2016-12-07 12:51
After that

vukolic
2016-12-07 12:52
The block is formed (cut)?

kostas
2016-12-07 12:53
So, in the two existing implementations (solo and kafka), when the batch is filled you don't currently have any queueing in the blockcutter going on. For instance in solo, it's the same thread that invokes the blockcutter and then does the processing, all sequentially sans buffers in between. (Same for Kafka). https://gerrit.hyperledger.org/r/gitweb?p=fabric.git;a=blob;f=orderer/solo/consensus.go;h=e7d79e8cc458448a08ba4258da6649584520a58d;hb=refs/heads/master#l96

kostas
2016-12-07 12:53
Yes.

vukolic
2016-12-07 12:54
o-k

vukolic
2016-12-07 12:54
so this would be different in pbft where batching happens before ordering

vukolic
2016-12-07 12:54
(at the leader)

vukolic
2016-12-07 12:55
do you have some doc around impl of kafka orderer?

vukolic
2016-12-07 12:56
or should I browse the code :stuck_out_tongue:


vukolic
2016-12-07 12:57
ok very good so you thought about this

vukolic
2016-12-07 12:58
now my question for cut-then-order

vukolic
2016-12-07 12:58
if we have fixed size batches/blocks this will drown throughput

vukolic
2016-12-07 12:58
how sbft works currently is that all pending tx form one big block

vukolic
2016-12-07 12:58
and if this is beyond MaxBlockSize

vukolic
2016-12-07 12:58
so be it

kostas
2016-12-07 12:59
RE: Kafka documentation. I have a high-level overview about why the Kafka orderer is designed the way it is https://docs.google.com/document/d/1vNMaM7XhOlu9tB_10dKnlrhy5d7b1u8lSY8a-kVjCO4/edit but not sure if it's the low-level detail that you want. (Code's quite straightforward as well though: https://github.com/kchristidis/fabric/commit/f9006f4c997dbbc8ae5a8f6e1b45fbf1cb3afffa#diff-cfa1d868ef1cd7b2f466aca2b058e752R173)

vukolic
2016-12-07 13:00
ok - your order and cut explanation is for the moment sufficient

vukolic
2016-12-07 13:01
but for cut-then-order the current fixed size blockcutter won't really fly until we have pipelining in sbft

vukolic
2016-12-07 13:01
otherwise the performance will suck big time

kostas
2016-12-07 13:01
I see your point - makes sense.

yacovm
2016-12-07 13:46
@kostas I saw that the JIRA issue regarding gRPC TLS pinning has been marked *Done*. Did you guys implement in the kafka shim something similar to what simon implemented in the connection.Manager?

kostas
2016-12-07 13:51
@yacovm: There's no option to delete an issue (more appropriately: there is, but isn't exposed to us). "Done" is me clearing out the issues under the "consensus" component that we are no longer working on. As far as I can guess, this will be captured by Gari's common gRPC server work.

yacovm
2016-12-07 13:52
So he's also going to implement the pinning?

yacovm
2016-12-07 13:52
I didn't see that in the JIRA issue

kostas
2016-12-07 13:53
I do not know, and I could be wrong. I have a backlog of issues I need to double-check on, and this is one of them.


yacovm
2016-12-07 13:54
I see...

c0rwin
2016-12-07 13:57
@claytonsims is there a way to add to the JIRA workflow statuses like: “Invalid”, “Won’t fix”?

yacovm
2016-12-07 13:57
Maybe ask the guys in #ci-pipeline to manually override the status of the issue? or give people permissions to do so?

kostas
2016-12-07 13:58
(Also, from what I can tell by reading the issue, it was about _investigating_ pinning, so no concrete deliverable was tied to it.)

kostas
2016-12-07 13:58
`WONTFIX` would be lovely.

yacovm
2016-12-07 13:58
hey I was just curious, that's all :wink:

kostas
2016-12-07 13:59
No problem, you did well and you reminded me to check with FAB-1255 as well.

yacovm
2016-12-07 13:59
I also don't think that it's correct to implement certificate pinning inside a consensus plugin (I mean- simon did that in the sbft because it has multiple nodes, but maybe in the future we'll have more types of these, and this sounds too cohesive with security to be in the sbft package anyway)

kostas
2016-12-07 14:04
Why do I have glimpses of a conversation here a couple of weeks ago that treated pinning as a done deal? I remember pointing out the issues that Keith had mentioned in https://jira.hyperledger.org/browse/FAB-708?focusedCommentId=19397&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-19397

yacovm
2016-12-07 14:05
``` So is the issue that go grpc doesn't provide a way to enable mutual TLS and to extract the client cert on the server? I see the following for doing mutual TLS https://github.com/grpc/grpc-go/issues/403 but doesn't give you access to the client cert. I see this discussed (at length) in https://github.com/grpc/grpc-go/issues/111. It seems to indicate that we need to implement our own TransportAuthenticator but is not clear. ``` ?

yacovm
2016-12-07 14:06
you mean this?

kostas
2016-12-07 14:06
That's what I had pointed out when it was brought up a couple of weeks ago. (I am aware of the fix that you pushed later on, and I remember our conversation later that day.)

yacovm
2016-12-07 14:07
so, if you replay the memory forward you'll remember that I said it can be done :thumbsup_all:

kostas
2016-12-07 14:08
Correct.

yacovm
2016-12-07 14:09
I actually have a commit pending that extracts the certificate from both ends: https://gerrit.hyperledger.org/r/#/c/2841/ for the gossip mutual TLS

kostas
2016-12-07 14:09
Does this seem like something that should be included in the FAB-1255 work, or are we overstepping our boundaries here?

yacovm
2016-12-07 14:09
does what seem?

kostas
2016-12-07 14:10
TLS pinning.

yacovm
2016-12-07 14:11
oh. well, I dunno. I'd ask Gari, if he's planning to do something like that in his gRPC server, I didn't see that in the sub-tasks though... But, isn't TLS pinning really related to identity though? (you need to know who are the servers you pin their cert)

yacovm
2016-12-07 14:11
I mean- shouldn't the crypto guys be involved with this somehow?

kostas
2016-12-07 14:12
(Outside my domain, so I'll defer to others for the right call.)

jyellick
2016-12-07 15:14
https://hyperledgerproject.slack.com/archives/fabric-consensus-dev/p1481079923000359 @kostas @muralisr This comment is absolutely true, sorry about that. I had your comment from that changeset in mind, thinking that we should really delay the ack until after `Enqueue` returns successfully, which would give us the desired 'in consensus' behavior. So yes, today, an ACK only means that the shim has it, not that it is in consensus. A failure to send to Kafka for instance could absolutely cause the transaction to be lost.

muralisr
2016-12-07 15:20
makes sense .. just wanted t know where the boundaries were.. thanks Jason, Kostas

jyellick
2016-12-07 15:29
https://hyperledgerproject.slack.com/archives/fabric-consensus-dev/p1481115660000410 You may choose to cut a block at any time by invoking `CutBlock`. In solo this is done via a simple timer, but in actually distributed consensus, the decision on where to invoke `CutBlock` must be consented upon. In the Kafka case, the last I heard was to have the shims send a special 'cutblock' meta-message for a particular block when a timer expires, and the first message to arrive wins. In the SBFT case I think this is actually easier, as the leader may do the simple timer logic, and the backups merely need to replicate his behavior by passing the transactions in order through the blockcutter and invoking `CutBlock` themselves. Note that in order to support configuration on the chain, some transactions now modify 'state', and this is going to complicate sbft noticeably, especially supporting pipelining. It should not be insurmountable, and should generally not affect performance as the state modifying transactions should be extremely infrequent. In short, 'normal transactions' may be ordered in a pipelined fashion, but a reconfiguration transaction must be executed before any additional transactions may be ordered.

jyellick
2016-12-07 15:29
@vukolic ^

vukolic
2016-12-07 15:31
What reconfiguration do we plan to support for v1?

scottz
2016-12-07 15:34
@jyellick Can we expect that an event notification be raised for every transaction, ordered or not, including failures such as these lost messages (especially since we had already ack'd them)?

jyellick
2016-12-07 15:36
@vukolic The biggest piece of reconfiguration is chain membership, ie who is allowed to transact on the chain.

vukolic
2016-12-07 15:37
Client/peer membership or orderer membership?

jyellick
2016-12-07 15:38
Client/peer membership is the must have. There's also the requirement to support chain creation which is a variation on that theme.

jyellick
2016-12-07 15:38
For Kafka, we will support orderer membership changes, but for SBFT I think we could say orderer membership changes are not allowed.

vukolic
2016-12-07 15:38
Chain creation seems irrelevant for pbft

jyellick
2016-12-07 15:39
Why is that?

vukolic
2016-12-07 15:39
It is used only for confidentiality and that does not make much sense in the Byz model

vukolic
2016-12-07 15:39
As Byz orderer can leak over channels as it pleases

jyellick
2016-12-07 15:40
Yes, I agree the use case is a little fuzzy for multi-chain and byzantine. But, it comes essentially 'for free', so I don't see any reason to explicitly disallow it.

vukolic
2016-12-07 15:41
It is not really for free

vukolic
2016-12-07 15:41
It requires a lot of acounting

vukolic
2016-12-07 15:41
Unless somebody else takes care of this

jyellick
2016-12-07 15:41
Right, this is a common component that sbft does not need to handle

vukolic
2016-12-07 15:41
And pbft just forwards metadata around

vukolic
2016-12-07 15:41
Effectively multiplexing channels

jyellick
2016-12-07 15:43
There's nothing which prevents the consensus algorithm from multiplexing channels if it so desires. However, I do think that would be a lot of work. The easiest way to handle it would simply be to run an instance of sbft per channel.

jyellick
2016-12-07 15:43
@scottz https://hyperledgerproject.slack.com/archives/fabric-consensus-dev/p1481124853000459 What event are you looking for? The only events I'm aware of are events encoded in transactions, which are processed when the transaction is committed to the chain.

kostas
2016-12-07 15:45
(And to that question, I would also add: Where would you like to see that notification raised? Locally on the orderer, or relayed to the client somehow?)

vukolic
2016-12-07 15:47
If we have channels in bft for performance reasons (not to order everything on a single chain) then channels make more sense

vukolic
2016-12-07 15:47
But i see prevalent mention of multi chain for confidentialty and there it does not make sense

vukolic
2016-12-07 15:48
In the bft case

jyellick
2016-12-07 15:49
Totally agree, BFT + Multichain Confidentiality does not make sense.

adc
2016-12-07 15:49
+1

scottz
2016-12-07 16:00
The client should be able to get success and failure notiications for every transaction. Success case is when the transaction is written to the ledger. Failure cases abound everywhere, in orderer system and in peers whereever they may get dropped or determinied invalid. Picture in my mind says when the transaction object is destroyed (after dropping from a queue, or more simply inside error-checking code after being determined to have the wrong signatures, or whenever) it could raise a failure event - and yes that must be forwarded to client.

scottz
2016-12-07 16:02
The ACK you mentioned signifies simply being submitted for ordering. We need to know whether it gets written to ledger, or not.

kostas
2016-12-07 16:02
Right, that goes back to our discussion earlier. This piece of code will need some reworking.

kostas
2016-12-07 16:04
And to play devil's advocate here for a second, couldn't you argue that you know whether it gets written to the ledger by actually reading the ledger? (Via a Deliver call.)

scottz
2016-12-07 16:05
ah, and how long should I wait before reading the ledger? sounds like v0.5 again.

scottz
2016-12-07 16:07
how long is "long enough that it should have been written by now" ?

scottz
2016-12-07 16:07
so, no, that logic is not a usable solution...

kostas
2016-12-07 16:08
Wonderful. As you'll see from last night's conversation, I have argued for a proper ACK mechanism.

jyellick
2016-12-07 16:10
@scottz I'm still not sure how/where you would like to get events? The client may send a message and then disconnect. Where do events go? How long are they persisted? Who is authorized to read them?

jyellick
2016-12-07 16:11
@kostas I will try to implement that proper ack today

scottz
2016-12-07 16:53
Hmm. So do we agree that it needs to be done? Your questions are about requirements and implementation. I admit I am not sure exactly what the behavior should be or what is feasible. I believe Jim Zhang and his sdk dev team is implementing event system. I expect that test/applic code should be able to register for events either per-transaction or per-transaction-type or for all failurs or all successes or who knows what. (I do not know the details of the planned API or the implementation, and whether they are using the event-listener or something else. Does the SDK attach to the Tx a list of interested parties to be notified? Or is there a defined port used? I guess I should get talking to them.)

scottz
2016-12-07 16:53
Nevertheless, whatever is implemented in SDK will depend on the rest of fabric code being able to handle transactions reliably through to completion - whether successfully written to ledger, or graceful failures due to error checks or validation etc, or even ungraceful failure such as dropped msg.

jyellick
2016-12-07 16:56
Ah, so this is about the SDK, I had assumed it was about the orderer. That makes more sense. I will fix the orderer so that it only ACKs after the message is truly 'in consensus'. However, if the system is reconfigured between transaction ingress and transaction egresss (into a block), I'm not sure of any course of action to take other than log it and 'throw it away'.

jyellick
2016-12-07 16:57
This should be an exceedingly rare event though, as it's only an issue during chain reconfiguration.

scottz
2016-12-07 17:13
can you clarify "chain reconfiguration" use-cases that hit this window? Is it specifically when a chain-config transaction to remove a user (or peer?) is closely preceded by a transaction removing that particular user or peer? Or other use cases?

jyellick
2016-12-07 18:15
@scottz Maybe a good example would be a certificate which gets revoked. 1. Admin revokes a certificate, and pushes a configuration transaction with this revocation 2. Orderer pre-validates the config transactions and sends it to be ordered 3. User submits a transaction signed with the revoked certificate 4. Orderer pre-validates the transaction (returning a `SUCCESS` status) and sends it to be ordered, as the configuration has not been applied yet 5. Orderer has ordered the config transaction, and applies the new configuration, revoking that cert 6. Orderer has ordered the user transaction and does final validation before including it in a block, and realizes this transaction is no longer valid, so, the orderer logs the anomoly (in general, pre-filtering and final validation should always return the same result, unless the config changes in between) and discards the transaction

scottz
2016-12-07 18:56
@jyellick Right; that seems like a better description of the scenario I was trying to say. "Revoking privileges of userX at same time that userX proposes a transaction" is a direct-impact specific use-case that I can accept, because we are trying to stop that user's transactions anyways. I guess what I am asking is: is there something potentially more common or wide-impacting? something with indirect impact, as a result of changing something that is referenced from the transaction payload header? For example, could we lose all queued pre-validated transactions on a chain if that chain's policy is modified? I don't think so, unless the chainID actually changes somehow when you reconfigure the chain ...

jyellick
2016-12-07 19:30
@scottz It's certainly possible to construct a policy change which could cause all queued transactions to be discarded, for instance, set the authorized writers to the empty set. Though I don't know why anyone would do that. The only thing that comes to mind as a possibly more common case is if a transaction is detected as a replay attack because it is in the wrong epoch. It's possible that a transaction enters the system during one epoch, and the epoch advances before the transaction is ordered, so it is no longer valid when it comes time to apply. Assuming the SDK sets the epoch correctly, this hopefully won't be an issue, and as I said, we have not implemented this yet.

mmayorivera
2016-12-07 21:37
has joined #fabric-consensus-dev

mmayorivera
2016-12-07 21:39
hi there

mmayorivera
2016-12-07 21:40
anybody can give a nice simple way to implement consensus PBFT?

kostas
2016-12-07 23:08
Simple and PBFT don't really go hand in hand. For an implementation in the previous architecture, checkout the v0.6 branch and look into `consensus/pbft`. For an implementation in the new architecture that's still a WIP, checkout `orderer/sbft` in the master branch.

mmayorivera
2016-12-08 01:18
thanks, I will

tuand
2016-12-08 15:01
scrum ...

2016-12-08 15:01
@tuand has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/q54kpv7kancaxmbmcl4hicxupie.

oliverledger
2016-12-08 21:16
has joined #fabric-consensus-dev

binhn
2016-12-09 13:22
Any issues in moving orderer/util to protos/utils? I need the same functions to create test blocks on the peer side

tuand
2016-12-09 14:17
np ... i'll get a changeset going

mmayorivera
2016-12-09 21:22
anybody knows what "CAE=" in consensusMetadata means???

mmayorivera
2016-12-09 21:23
and also any good example of event handling in 0.6 and 1???

mmayorivera
2016-12-09 21:23
please.....

jyellick
2016-12-09 21:46
@mmayorivera This (the consensus metadata) is likely the marshaled PBFT sequence number

jyellick
2016-12-09 21:47
It is intended to be opaque to non-consensus components

jyellick
2016-12-09 21:47
(As it could vary by consensus implementation)

tuand
2016-12-12 15:00
scrum ...

kostas
2016-12-12 15:01
Link?

2016-12-12 15:01
@tuand has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/2fjvq6xp6vf37g3fo3sonolvtme.


kostas
2016-12-12 23:01
So, Bishop's graph above shows the flow control that is performed by HTTP/2 (https://http2.github.io/http2-spec/#FlowControl) when pushing messages down a system that performs no flow control on its own (other than to block when its queue is full, as is the case with the Kafka orderer). Notice how the latency (between transmission and reception) settles to a steady state after a while.

kostas
2016-12-12 23:01
The argument here is that since flow control is being handled at the underlying layer, we should not worry about it on the application layer. This is w/r/t to the work that's being done on the broadcasting side to fix some of the issues on the common component.

kostas
2016-12-12 23:05
(@bcbrock please correct me if I'm misrepresenting things here.)

kostas
2016-12-12 23:07
I'm thinking about this and I think I'm sold. Given the graph above, do we still have good reasons for wanting to do flow control on the app layer? I may be missing something.

bcbrock
2016-12-12 23:08
One other piece of info. is needed for the proof: In this run, 200K TX were broadcast and delivered. It took between 35 and 40 seconds. If the broadcast queue is of size 200K, the broadcast of 200K TX will complete in 10 seconds. Also note that broadcast and deliver clients are separate, completely independent processes. This and the above show that the underlying protocols are handling the flow control.

vukolic
2016-12-13 00:20
if I am reading well the same would apply to (s)bft?

vukolic
2016-12-13 00:23
I am asking in the context of @chetsky out-to-lunch tests :wink: https://github.com/bft-smart/library/issues/27

vukolic
2016-12-13 00:26
@kostas @bcbrock ^^

vukolic
2016-12-13 00:27
can a Byz HTTP/2 sender somehow circumvent this flow control?

vukolic
2016-12-13 00:29
(if so this attack would be relevant to Kafka as well)

bcbrock
2016-12-13 00:32
I assume a malicious implementation of GPRC could ignore the HTTP/2 flow control. I don’t know what would happen in that case.

kostas
2016-12-13 00:39
Good point. The only reference to this on the HTTP/2 spec is on Section 10.5 (talking about the abuse of WINDOW_UPDATE). Based on what I'm reading here https://www.imperva.com/docs/Imperva_HII_HTTP2.pdf, it comes down to whether the HTTP/2 server is implemented in a way that addresses this concern.

kostas
2016-12-13 00:39
> In at least two cases we found HTTP/2 implementations that specifically failed to account for the typical traps designers warn about. See Flow Control DoS, Dependency Cycle. One example is an attack based on abuse of the flow control WINDOW_UPDATE for DoS attacks, which was specifically warned about...

kostas
2016-12-13 00:39
So then the question is whether gRPC addresses that.

kostas
2016-12-13 00:42
(And I can see an argument in favor of flow control in the app layer being that we don't want to be married to how the underlying framework servers HTTP/2 requests, but I would argue that whatever framework we choose _should_ address this.)

kostas
2016-12-13 00:42
Anyway, I'll look into gRPC.

vukolic
2016-12-13 00:43
thks

jyellick
2016-12-13 00:52
@bcbrock @kostas Which flow control mechanism was this, is this pre or post https://gerrit.hyperledger.org/r/#/c/3185/ ?

bcbrock
2016-12-13 01:16
@jyellick I am trying to argue that neither the broadcast nor deliver protocols require explicit flow control (windows). The experiment above was using Kosta’s code https://github.com/kchristidis/fabric/tree/fab-819-preview

bcbrock
2016-12-13 01:17

bcbrock
2016-12-13 01:18
Where the broadcast server sends an error response if it’s queue overflows. Instead, we simply let the broadcast server block.

kostas
2016-12-13 01:27
(@jyellick: The point here being that the work in 3185 may not be necessary, and instead we move back to a simpler model, similar to the one that I wrote for the Kafka orderer originally, pre- common components)

bcbrock
2016-12-13 01:49
@kostas My reading of the Go grpc code is that it does check the window limits, and will close the connection or stream as appropriate according to which window size was violated. I’ve hacked the http2_client to behave badly and gotten the http2:ErrCodeFlowControl == codes.ResourceExhausted error (although it did not manifest how I might have expected, so I may not have hacked it correctly.)

kostas
2016-12-13 02:18
Ah, excellent. I also opened an issue on the gRPC-go repo to get confirmation of this behavior.

jyellick
2016-12-13 03:16
@bcbrock @kostas Sounds promising. Seems like it might be a clear win for broadcast, though would like to think on this a little more. On deliver, getting rid of the window size seems a little more problematic without other API changes, as the window size allows a client to do something like "retrieve blocks 3-7" without the server attempting to deliver blocks "3 until the gRPC buffer fills up". Maybe the right answer is to just modify the API to more explicitly support this though.

jyellick
2016-12-13 03:19
Since this buffer is at the HTTP2 layer, I assume that the buffer is shared for the whole stream? IE, 10x the broadcast calls does not give you 10x the window (which sounds like yet another advantage.)

jyellick
2016-12-13 03:34
Okay, my gut reaction is to kill windows in both broadcast and deliver, then modify the Deliver API to require explicit ranges (start and end). Where it would be simple to specify the end as `^uint64(0)` to receive blocks indefinitely. This seems easier for a client to implement, and assuming HTTP2/gRPC handles the windowing for us, more efficient than the API as it stands.

jyellick
2016-12-13 03:34
(Will continue to think on this though)

adc
2016-12-13 08:45
Hi @tuand @jyellick @kostas, shall we start a discussion on how to integrate the MSP into the orderers. I think, it is a good time now. After the discussion I expect I can give you a way to sign messages with the default identity and have support for access control at chain creation and once a chain is created. Please, let me know :slightly_smiling_face:

nits7sid
2016-12-13 12:16
Hi..Does ordering service uses PBFT?

hgabor
2016-12-13 12:32
@nits7sid in v0.6 or in "master"?

hgabor
2016-12-13 12:34
in v0.6 we have PBFT but also have a noops orderer (as I remember it is a 1 node orderer which orders by the time of requests) in "master" we have solo, kafka, and sbft (that is Work In Progress)

pd93
2016-12-13 13:22
has joined #fabric-consensus-dev

karlkay
2016-12-13 14:07
has joined #fabric-consensus-dev

adc
2016-12-13 14:24
@jyellick, @kostas @vukolic how the validity of a block is verified? May you give me a pointer to the code?

vukolic
2016-12-13 14:24
What do you subsume under validity of a block

vukolic
2016-12-13 14:26
Endosrment policy validation / vscc or sth else?

adc
2016-12-13 14:28
that it is signed but enough orderers, I guess

jyellick
2016-12-13 14:30
@adc Since blocks aren't signed yet, they aren't verified.

jyellick
2016-12-13 14:31
We'll create a policy with a name like `BlockValidator` or something like that, which can be applied to the signatures over a block to check for validity.

adc
2016-12-13 14:32
I ask, because for gossip I should provide a VerifyBlock method

adc
2016-12-13 14:32
but the implementation will mostly come from you guys, I was thinking this morning

nits7sid
2016-12-13 14:32
@hgabor master

adc
2016-12-13 14:32
okay, I will wait then. If help is needed let me know

svergara
2016-12-13 14:37
has joined #fabric-consensus-dev

jyellick
2016-12-13 15:03
@adc Yeah, I think this should be pretty simple for you, we'll encode a policy in the chain config block, you'll just pass the signature set into the policy and it will return nil or an error to let you know if the block is valid.

adc
2016-12-13 15:04
do you mean that different chains might have different block policies?

jyellick
2016-12-13 15:05
I wouldn't expect for them to out of the gate, but I would certainly plan to allow that. Especially once you consider that there's no reason a peer couldn't subscribe to multiple ordering services.

adc
2016-12-13 15:05
got it, let's start anyway from the simple scenario :slightly_smiling_face:

jyellick
2016-12-13 15:40
@kostas @bcbrock https://gerrit.hyperledger.org/r/#/c/3253/ This removes the queuing entirely from broadcast, I'll open another story and remove the windowing from Deliver as well

jyellick
2016-12-13 15:43
After some additional thought, I also realized where the idea of the queues and windowing came from. In 0.5/0.6 we were sharing a common gRPC stream between components, which meant that blocking on a gRPC call could cause the stream to fill up, and starve other components, (and lead to deadlocks). Since we have the clearer separation of concerns in v1, this should no longer be an issue.

vukolic
2016-12-13 16:11
@adc signatures in sbft can be found in checkpoint.go

vukolic
2016-12-13 16:12
and from there you could try "find all references" if your ide supports it

adc
2016-12-13 16:14
aha, great

adc
2016-12-13 16:15
I can have a look at that

adc
2016-12-13 16:15
do we expect that sbft, kafka, etc will have the same way to validate a block?

vukolic
2016-12-13 16:16
they will have a different one

vukolic
2016-12-13 16:16
but conceptually the idea should be that peer calls a sth like consensus.verifyBlock funciton

vukolic
2016-12-13 16:16
which needs to be implemented by every consensus implementation

vukolic
2016-12-13 16:16
some will have one signature to verify

jyellick
2016-12-13 16:16
@vukolic We have a generic mechanism for this

vukolic
2016-12-13 16:16
some f+1

vukolic
2016-12-13 16:16
some 2f+1

vukolic
2016-12-13 16:17
which one?

jyellick
2016-12-13 16:17
We have signature policies, where you can require "N out of {set of signatures}"

jyellick
2016-12-13 16:17
For the CFT case, N is 1, for the BFT case, N can be set to f+1 or 2f+1

vukolic
2016-12-13 16:18
how is this different from what I mentioned? :slightly_smiling_face:

jyellick
2016-12-13 16:18
I mean simply that from a consumer's point of view, they simply get the block validation policy, and evaluate it

jyellick
2016-12-13 16:18
They should not need to worry about the implementation

adc
2016-12-13 16:18
so, to validate a block, I need to know the channel it refers to and then I can apply the policy to the block, correct?

adc
2016-12-13 16:18
The block itself does not tell me to which channel it belongs to, correct?

jyellick
2016-12-13 16:18
The block will tell you which channel it belongs to

vukolic
2016-12-13 16:19
you seem to be simplifying implementation of the policy for a consensus implementaiton

vukolic
2016-12-13 16:19
that is fine

jyellick
2016-12-13 16:19
Because all blocks are non-empty

vukolic
2016-12-13 16:19
but is still a different implementation

jyellick
2016-12-13 16:19
And all transactions contain a chainID

adc
2016-12-13 16:19
I'm asking because gossip needs that

jyellick
2016-12-13 16:19
So, you may simply open up the first transaction and check its chainID, there is a utility method for this in `protos/utils` I believe

vukolic
2016-12-13 16:19
does this policy-oriented implementation account for threshold sigs?

adc
2016-12-13 16:20
great

jyellick
2016-12-13 16:21
@vukolic This is a better question for @adc / @elli but, my understanding is that it should be adaptable, The `Policy` is also an extensible type, today the types are `oneof { SignaturePolicy }` but, we can add other policy types as well.

adc
2016-12-13 16:22
for my understanding Jason's policy framework is generic enough to handle also threshold sigs

adc
2016-12-13 16:22
I like it :slightly_smiling_face:

vukolic
2016-12-13 16:22
I do not see threshold coming for v1 but certainly sth that might be possible down the road

vukolic
2016-12-13 16:22
Jason, we need to "sit down" and open a set of issues for merging sbft with common

vukolic
2016-12-13 16:22
the code is about to be ready for this...

vukolic
2016-12-13 16:23
by the way I am not a fan of the name sbft

vukolic
2016-12-13 16:23
it is really spbft

vukolic
2016-12-13 16:23
if s is to be kept at all

vukolic
2016-12-13 16:24
anyway I could use a walk through common to understand all the merge points

jyellick
2016-12-13 16:36
@vukolic Yes, would be happy to do this, would like to get the full flow of solo/kafka working first, so that we don't have to communicate changes. I think we are close though.

bcbrock
2016-12-13 17:39
@jyellick @kostas Guys I apologize for any misunderstanding, I don’t believe I said that removing explicit flow control would improve performance, only that it would make things simpler. Also I am confused about https://gerrit.hyperledger.org/r/#/c/3253/ if applied to the current master branch it will not affect the Kafka orderer. Am I supposed to try this against on of Kostas’ private branches ? If so which one?

vukolic
2016-12-13 17:41
inutively performance cannot be worse

vukolic
2016-12-13 17:41
and likely would be better

vukolic
2016-12-13 17:42
so Byz HTTP/2 sender messing up with flow control is ruled out?

kostas
2016-12-13 17:45
So testing this properly will need some cherry-picking.

jyellick
2016-12-13 17:46
@bcbrock Simpler with no performance penalty is still a win in my book, and yes, you'll need to wait to test 3253 and Kafka or apply them both manually

kostas
2016-12-13 17:46
But we can make it straightforward. Let me finish my rebasing this: https://gerrit.hyperledger.org/r/#/c/3207/

kostas
2016-12-13 17:46
Then you checkout 3207 and cherry-pick 3253 on top of it.

jyellick
2016-12-13 17:47
For what it's worth, I'm halfway through removing windowing from the deliver code and it is a very significant complexity reduction, with reduced locking as well, which (by intuition only) should improve performance.

kostas
2016-12-13 17:47
But I can always do this myself and push to a private repo. In fact, that's what I'll do.

kostas
2016-12-13 17:48
I'll need an hour or so.

kostas
2016-12-13 17:48
I'll post the link here.

bcbrock
2016-12-13 17:48
@vukolic So far no one has answered Kostas’ question (about whether Go grpc guarantees to reject windowing violations)

bcbrock
2016-12-13 17:48
@kostas Thanks

kostas
2016-12-13 17:49

vukolic
2016-12-13 17:53
ok I am watching that one now

vukolic
2016-12-13 18:06
so if RFC is respected then a GRPC send is (may be) blocking at sender?

bcbrock
2016-12-13 18:07
Yes. We see that in the latency chart from yesterday. The sender is throttled by the ability of the receiver to process the data.

vukolic
2016-12-13 18:14
thks

vukolic
2016-12-13 18:15
RFC spirit seems to contain Byz sender

vukolic
2016-12-13 18:15
but lets wait for the answer


vukolic
2016-12-13 18:32
i wonder how grpc stands with that "slow read" attack

kostas
2016-12-13 20:39

kostas
2016-12-13 20:41
It does take care of it. Besides the links that Menghan included, also check this: https://github.com/grpc/grpc-go/blob/master/transport/control.go#L158

vukolic
2016-12-13 20:52
thanks Kostas

vukolic
2016-12-13 20:52
if any maintainers here I have this https://gerrit.hyperledger.org/r/#/c/3273/

muralisr
2016-12-13 20:58
I don’t dare review this @vukolic … @jyellick you around ? :slightly_smiling_face:

muralisr
2016-12-13 20:59
or @hgabor ?

vukolic
2016-12-13 21:01
@muralisr why? :slightly_smiling_face: there is a nice test! :slightly_smiling_face:

muralisr
2016-12-13 21:01
ok, if you put it that way… let me look :slightly_smiling_face:

vukolic
2016-12-13 21:01
inspired by "real world", live testing trace

vukolic
2016-12-13 21:02
it's amazing how much interleaving can be produced by a non deterministic test

vukolic
2016-12-13 21:03
the trace interleaving was such that it delays all connections to primary who starts alone and then complains about itself - but this leads to a deadlock

jyellick
2016-12-13 21:03
@muralisr I can try to take a look in a few, currently running down a different bug

vukolic
2016-12-13 21:03
so much fun testing this stuff

muralisr
2016-12-13 21:03
thanks Jason

muralisr
2016-12-13 21:03
I’ll look but won’t touch then :slightly_smiling_face:

muralisr
2016-12-13 21:04
@vukolic has made it sound fun :slightly_smiling_face:

vukolic
2016-12-13 21:04
what can I do - you all ran away from this

vukolic
2016-12-13 21:04
leaving me alone

vukolic
2016-12-13 21:04
with notable exception of @hgabor

vukolic
2016-12-13 21:04
so I need to make it sound like fun so somebody joins back

hgabor
2016-12-13 21:05
I will review everything- but only tomorrow :P as 10 PM here

hgabor
2016-12-13 21:05
but it is really fun

muralisr
2016-12-13 21:08
`so I need to make it sound like fun so somebody joins back` … one of these days I hope !

joshhus
2016-12-13 22:00
has joined #fabric-consensus-dev

joshhus
2016-12-13 22:02
Hello, is SBFT Simplified BFT? ... Google reveals a "Scalable" BFT already in the literature. HL general indicates that SPFBT is being discussed. Thanks.

nage
2016-12-13 22:07
has joined #fabric-consensus-dev

vukolic
2016-12-13 22:17
@joshhus yes sbft is simplified pbft

vukolic
2016-12-13 22:17
it may change name to simplepbft spbft or sth else

vukolic
2016-12-13 22:17
sbft is not yet carved in the stone

vukolic
2016-12-13 22:18

jeno.gocho
2016-12-13 23:27
has joined #fabric-consensus-dev

baohua
2016-12-14 09:14
has joined #fabric-consensus-dev

baohua
2016-12-14 09:15
yes, spbft may be better

baohua
2016-12-14 09:16
and is there any discussion on migrating the orderer service outside of fabric code, into some separate one (fabric-order), like fabric-cop?

hgabor
2016-12-14 09:57
maybe it would be a good idea to have a separate repo for that code

vukolic
2016-12-14 11:44
due to the popular request

vukolic
2016-12-14 11:44
differences of (what is currently) sbft wrt pbft


simon
2016-12-14 11:55
hi

simon
2016-12-14 11:56
good bug fixing on sbft

vukolic
2016-12-14 12:05
oh well hello

vukolic
2016-12-14 12:05
coming back?

vukolic
2016-12-14 12:05
:wink:

simon
2016-12-14 12:14
jira just sent me a message for a commit

simon
2016-12-14 12:15
i don't understand the message reordering thing

simon
2016-12-14 12:15
messages from every replica should be processed in order, no?

vukolic
2016-12-14 12:19
I did not have a chance to dive into that - but experiments say no

vukolic
2016-12-14 12:19
I think the bigger dependency of the codebase is that we expect reliable delivery of msgs so long as the connection is up

vukolic
2016-12-14 12:19
which is true

vukolic
2016-12-14 12:19
and simplifies code

vukolic
2016-12-14 12:19
reordering is not so much an issue

yacovm
2016-12-14 12:30
How can you expect reliable delivery of msgs as long as the connection is up? from what I know- gRPC `Send()` method returns when the message is put in the buffers, not when it reaches the other side. (I hope I'm not asking out of context here)

vukolic
2016-12-14 12:31
I am assuming msg would be delivered from the buffer - so long as the connection is up

vukolic
2016-12-14 12:31
(unless the receiver is faulty)

vukolic
2016-12-14 12:32
I mean we need some fetaure trom TCP, HTTP/2

vukolic
2016-12-14 12:32
I think this is the one we get

yacovm
2016-12-14 12:32
oh, I see. I thought you were assuming that message is delivered if the `Send` didn't return an error

vukolic
2016-12-14 12:33
@simon reording first appeared across tcp connections wher eit is expected to happen

vukolic
2016-12-14 12:33
say primary sends me pre prepare

vukolic
2016-12-14 12:33
but by the time I receive tht

vukolic
2016-12-14 12:33
I already received all the prepares and commits from others

simon
2016-12-14 12:33
yes

vukolic
2016-12-14 12:33
this was the source of one bug and has nothing to do with grpc FIFO

vukolic
2016-12-14 12:34
in some cases also single channel FIFO failed

simon
2016-12-14 12:34
yes, that's why we have the backlog, so that we can compensate for different arrival times

vukolic
2016-12-14 12:34
but I am not sure if that is not due to some re-ordering on the receipient side

vukolic
2016-12-14 12:34
after grpc

vukolic
2016-12-14 12:34
will need to look at that eventually

simon
2016-12-14 12:35
okay

adc
2016-12-14 14:51
Hi @tuand, is there a scrum call today?

tuand
2016-12-14 14:52
hey angelo monday/Thursday 10AM eastern

adc
2016-12-14 14:52
okay, I would like to start talking about integration of the MSP

adc
2016-12-14 14:52
we can do that during the scrum or earlier if you think so

adc
2016-12-14 14:53
I see that there are other change-set waiting for the merging

tuand
2016-12-14 14:53
yes, many changesets right now :slightly_smiling_face: let's do it right after the scrum ?

adc
2016-12-14 14:54
yes, please :slightly_smiling_face:

adc
2016-12-14 14:54
thanks

nits7sid
2016-12-14 16:54
in SBFT, how the ordering is being done?

kostas
2016-12-14 17:06
Heads up @hgabor: `TestMonotonicViews` is causing the master to fail. (`simplebft_test.go:200: Replica 0 must be in view 1, but is in view 2`)

kostas
2016-12-14 17:08
When tested locally, the test itself passes. When tested locally along with all the other tests in the `sbft` package, the test fails. I'm guessing (a) there's either an artifact from a previous test that affects the outcome of this one, or (b) the timing constraints are too tight.

hgabor
2016-12-14 17:57
Kostas: I can only look at it tomorrow if not a big problem

hgabor
2016-12-14 17:58
I guess "tests ran too long" is the problem

hgabor
2016-12-14 17:59
@vukolic any comments? :)

hgabor
2016-12-14 18:00
please add skip to those tests if they block anything urgent

hgabor
2016-12-14 18:00
I can +2

hgabor
2016-12-14 18:00
@kostas

vukolic
2016-12-14 18:40
I will patch it

vukolic
2016-12-14 18:40
Its inportant that the second view is not smaller

vukolic
2016-12-14 18:40
Which it was

vukolic
2016-12-14 18:40
It can be hugher

vukolic
2016-12-14 18:41
So != needs to be replaced by less

vukolic
2016-12-14 18:41
Will submit in an hour or so

vukolic
2016-12-14 18:41
Currently on my mobile

vukolic
2016-12-14 18:43
The test itself has this bug

vukolic
2016-12-14 19:01

vukolic
2016-12-14 19:01
This is probably a first HL fabric commit from a taxi in Madrid

kostas
2016-12-14 19:01
Excellent, thank you.

vukolic
2016-12-14 19:03
@nits7sid please refer to pbft paper (tocs 2002 version) as well as the SBFT vs PBFT diff described in https://jira.hyperledger.org/browse/FAB-378

baohua
2016-12-15 11:10
Hi, is there any discussion on migrating the orderer service outside of fabric code, into some separate one (fabric-orderer), like fabric-cop?

hgabor
2016-12-15 11:49
there could be :slightly_smiling_face:

baohua
2016-12-15 11:53
great, gabor, wanna hear more feedbacks to see if we can make it separate soon.

garisingh
2016-12-15 11:58
@baohua - what exactly are you trying to accomplish? even today you can run / use the ordering service without running any peer nodes. There's a few shared components under the fabric src tree, but you can definitely build and run it all by itself.

baohua
2016-12-15 12:00
Thanks, gari. As in the new arch, there are mainly several service roles: endoser/committer/consenters. I guess it would be somehow nature to decouple a orderer role from a common peer. This may also bring advantage of pluggable order service implementation. How do you think?

garisingh
2016-12-15 12:22
actually even in the current architecture , orderers are actually entirely different executables / code base from the peer code. If you take a look at the "common" components under orderer, you'll actually see that the base "server" is entirely different than that of a peer. And the ordering service is actually pluggable - externally it implements Broadcast and Deliver RPCs. So as long as your ordering service nodes implement those RPCs, you can plug them in (there's a little more detail in how you need to contract the genesis block as well but again that's common code which ca be shared across different orderer implementations) So a peer node can either be an endorser or committer (and endorsers really have to be committers as well for the most part). Ordering nodes are separate and need to implement the interfaces described above Now it technically possible that a peer node could be an ordering node but the current code is not implemented that way

baohua
2016-12-15 12:47
Sure garisingh. i also notice that the code base for peer and orderer are different, and also they share not many code base. So think they will evolve into two individual components functionally. Thanks~

tuand
2016-12-15 15:01
scrum ...

2016-12-15 15:01
@tuand has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/nvzhhad6h5b7lbrnybi4x6k3mye.

tuand
2016-12-15 15:02
@adc ^^^

adc
2016-12-15 15:06
oh yes

tom.appleyard
2016-12-15 17:50
Some quick questions about the Hyperledger v1 proposal: When we say endorsers check code is 'deterministic' and 'stable', what does this mean? Doesn't channelling transactions mean the consenters will have different ledgers If not, how are ledgers exchanged between consenters? How are they kept in sync? When we say consenters 'validate the integrity' of the transaction, what does this mean? Why does having a consenters increase transaction throughput, was the previous throughput limit constraint solely due to it needing to speak to all the nodes? Do both endorsers and consenters have a copy of the ledger, who stores it? Can anyone help me out with these?

jyellick
2016-12-15 17:55
@tom.appleyard > When we say endorsers check code is 'deterministic' and 'stable', what does this mean? The endorsers execute the chaincode to produce a read/write set. If the code is non-deterministic and multiple endorsements are required, there will be different results, which will be detected, and the transaction will be rejected. The read write sets are inherently deterministic. > Doesn't channelling transactions mean the consenters will have different ledgers If not, how are ledgers exchanged between consenters? How are they kept in sync? The ordering network (what you call consenters here I think) is necessarily a member of all channels for the ordering network. There is no technical reason which prevents multiple ordering networks for a set of peers, but this is not targeted for v1. > When we say consenters 'validate the integrity' of the transaction, what does this mean? The ordering network will make sure that the transactions are appropriately signed by an authorized transactor on the chain. After transactions have been ordered, the committing peers will do further checks based on MVCC and endorsement policies to further filter the transactions. > Why does having a consenters increase transaction throughput, was the previous throughput limit constraint solely due to it needing to speak to all the nodes? This is not an accurate statement. The throughput limit before was actually much more closely tied to the fact that the consenters were executing transactions (via chaincode) rather than just ordering them. Removing the execution from the path resulted in transaction rates which were orders of magnitude faster. > Do both endorsers and consenters have a copy of the ledger, who stores it? All peers retain a copy of the ledger for the chains they are participating in.

tom.appleyard
2016-12-15 18:13
@jyellick Thanks, that's shed some light - a few more questions if I may: >All peers retain a copy of the ledger for the chains they are participating in I assume this means there are now multiple ledgers, one for each channel they are connected to? >The ordering network will make sure that the transactions are appropriately signed by an authorized transactor on the chain. After transactions have been ordered, the committing peers will do further checks based on MVCC and endorsement policies to further filter the transactions. I take it a 'transactor' is the same as an endorder? Who does the ordering? How is it decided? Is it done by endorsers, consentors or both? Are they called 'consentors' or 'committers'? - here they are called 'consentors': https://github.com/hyperledger/fabric/blob/master/proposals/r1/Next-Consensus-Architecture-Proposal.md and here committers: https://hyperledger-fabric.readthedocs.io/en/latest/abstract_v1/ What is MVCC? What is an endorsement policy exactly? I thought it was a list of endorsers (or a minimum amount of them) that have to sign a transaction for it to be committed?

jyellick
2016-12-15 18:25
> I assume this means there are now multiple ledgers, one for each channel they are connected to? This is a question of semantics to me. Internally, I believe this is a single 'ledger' which supports multiple chains. There's nothing that would prevent flipping it though multiple ledgers, which each support individual chains. @dave.enyeart can maybe help here. > I take it a 'transactor' is the same as an endorder? Actually, not at all! A transactor is someone who is authorized to submit transactions on a chain. (An endorser is able to endorse transactions, but does not actually submit them. (Loosely) Instead, the client/SDK creates a proposal, the proposal is sent to some set of endorsers for endorsement, who reply to the client with a proposalresponse. The client then assembles these proposal responses into a transaction, signs that transaction, and sends it to the ordering service (who validates this last outer signature). The transaction is ordered, makes its way to the committers, who then verify the endorsements etc. > Are they called 'consentors' or 'committers'? - here they are called 'consentors': https://github.com/hyperledger/fabric/blob/master/proposals/r1/Next-Consensus-Architecture-Proposal.md and here committers: https://hyperledger-fabric.readthedocs.io/en/latest/abstract_v1/ I personally like to avoid using the word 'consenter' because it means different things to different people. The ordering network will be backed by some form of consensus implementation, but this consensus is only on the ordering, not on the output of the transactions. The peers take the ordered transactions to create the validated ledger, these are referred to usually as 'committer's. > What is MVCC? MVCC is multi-version concurrency control https://en.wikipedia.org/wiki/Multiversion_concurrency_control . > What is an endorsement policy exactly? I thought it was a list of endorsers (or a minimum amount of them) that have to sign a transaction for it to be committed? Close. The endorsement policy can be a little more powerful than this, for instance, it could require that an endorser from each of 3 different peer organizations signs a transaction (not just 3 endorsements required). @muralisr might have more for you here.

jyellick
2016-12-15 18:25
^ @tom.appleyard

tom.appleyard
2016-12-15 18:35
@jyellick Ah brilliant - thanks again, a few more questions: >The consenters form the consensus service, i.e., a communication fabric that provides delivery guarantees... Peers are clients of the consensus service, to which the consensus service provides a shared communication channel offering a broadcast service for messages containing transactions. Are there many consensus services (i.e. one for each channel) or just one consensus service which provices all channels? >The ordering network will be backed by some form of consensus implementation, but this consensus is only on the ordering, not on the output of the transactions. What's the relationship between the consensus service and the ordering network? Who is part of the ordering network? >Internally, I believe this is a single 'ledger' which supports multiple chains I'm a bit confused then, what's the difference between a chain and a ledger? >A transactor is someone who is authorized to submit transactions on a chain Is that the same as a client submitter?

jyellick
2016-12-15 18:45
> Are there many consensus services (i.e. one for each channel) or just one consensus service which provices all channels? Out of the gate, we'll simply have one ordering service for all chains. As I mentioned, there's no technical reason this has to be the case, but, in the interest of walking before we run, we're sticking to one to start. > What's the relationship between the consensus service and the ordering network? Who is part of the ordering network? Most likely, 'consensus service' and 'ordering network' are referring to the same entity. We switched to using 'ordering' instead of 'consensus' because of the confusion the word 'consensus' was causing. The ordering network may be offered as a service and not involve any of the entities transacting on the chain, or it may be run by one or more of the transacting entities. Out of the gate, our target is CFT, which mostly lends itself to a single entity running the ordering network, however, in parallel we are working on bringing a pbft based implementation of the ordering network to add BFT which will make a shared ordering network make more sense. > I'm a bit confused then, what's the difference between a chain and a ledger? My usage is that chain is the sequence of blocks, and the ledger is state associated with that sequence. But, I think you'll find people mix these words around a lot. > Is that the same as a client submitter? Yes

tom.appleyard
2016-12-15 18:46
@jyellick ah brilliant, thanks again!

dave.enyeart
2016-12-15 19:29
People do use the term ledger for different things. I therefore try to avoid it and speak specifically about the ‘chain' and the associated ‘state database’. When we say ‘ledger’ we often mean both taken together. For each channel there will be a chain and associated state database.

dave.enyeart
2016-12-15 19:33
Another thing to clear up from above, is that committers will not have a validated ledger as defined in the Next Consensus Architecture document. Each committer will have a raw ledger, with the blocks having a non-hashed section for indicating which of the transactions in the block were validated vs invalidated. This design serves much of the purpose of the validated ledger, keeps things simple by having the same blocks and same hashes on ordering service and committing peers, and makes it easier for auditors to understand which transactions were validated vs invalidated.

dave.enyeart
2016-12-15 19:34
When checkpointing is introduced post-v1, we will likely prune the invalid transactions out as part of the checkpoint process, leaving us with a validated ledger.

dave.enyeart
2016-12-15 19:43
@jyellick FYI due to the above, we will likely rename RawLedger and ValidatedLedger in the code, to be OrdererLedger and CommitterLedger in FAB-1390

muralisr
2016-12-15 19:45
@dave.enyeart `When checkpointing is introduced post-v1, we will likely prune the invalid transactions out as part of the checkpoint process, leaving us with a validated ledger.` … any reason to do that ?

muralisr
2016-12-15 19:45
seems a valuable piece of info, no ?

dave.enyeart
2016-12-15 19:45
simply to save space, if somebody has spammed a lot of invalid trans

muralisr
2016-12-15 19:45
ok

muralisr
2016-12-15 19:46
we’ll be throwing the baby with the bath water IMO

muralisr
2016-12-15 19:46
:slightly_smiling_face:

dave.enyeart
2016-12-15 19:46
i expect there will be a config option to prune or not

baohua
2016-12-16 00:09
@dave.enyeart Is there any protection scheme now by detecting too much invalid trans from some entities (e.g., DDOS) and take some control? Thanks!

dave.enyeart
2016-12-16 02:36
I don’t know of any DDOS protections, let’s see if anybody else has thoughts...

kostas
2016-12-16 03:07
No DDoS protection built on the ordering side (yet at least) either.

hgabor
2016-12-16 07:52
if I use `main` instead of `sbft` as executable in the test, will it work? I mean, is the main executable compiled before the test? I guess it is https://gerrit.hyperledger.org/r/#/c/2515/

hgabor
2016-12-16 08:50
no it won't - it is not built

hgabor
2016-12-16 08:50
I need some suggestions how to do this

tom.appleyard
2016-12-16 14:13
@jyellick @dave.enyeart Had another read through the answers, couple more questions: >For each channel there will be a chain and associated state database. Just to check, am I correct in thinking there are multiple blockchains and multiple worldstates (one for each channel) on each peer? Do committers handle all transactions or only transactions for certain chaincode? How does a submitting client know which endorser to end it's transaction proposal to? >committers will not have a validated ledger What is a validated ledger? >When checkpointing is introduced post-v1 What is checkpointing? How does a submitting client know which endorser has the chaincode for its transaction? Why is sending the transaction proposals to all the endorsers the burden of the client instead of the client sending it to 1 endorser which spreads it around? Do the committers then execute the chaincode again? If not, do the endorsers send the results of their executions to the commiters? If that's the case, what governs where each committer gets it's results from?

vukolic
2016-12-16 14:14
@dave.enyeart we need to sync on the terminology changes to NCAP

vukolic
2016-12-16 14:14
for instance early next week

jyellick
2016-12-16 14:20
> Just to check, am I correct in thinking there are multiple blockchains and multiple worldstates (one for each channel) on each peer? Correct. (One for each chain the peer is participating in) > Do committers handle all transactions or only transactions for certain chaincode? All transaction for each chain they are participating in > How does a submitting client know which endorser to end it's transaction proposal to? This is managed by the application > What is a validated ledger? There was the idea that once transactions were ordered into a 'raw chain' which would contain properly formed and signed, but not necessarily valid (because of MVCC conflicts etc.). Then, the peers would essentially create a second chain from this 'raw chain', this 'validated chain/ledger' would contain only the transactions that actually applied. In implementation, it was easier simply to provide a bitmask for which transactions in a block are valid rather than essentially store two copies of everything. > What is checkpointing? For long running blockchains, it's infeasible to require that a new peer sync trillions of blocks, so instead, like with most dbs, there will need to be support for 'snapshotting' the world state, then archiving the old chain. This could be done once a year, or every 10 million blocks, or 100 GB or whatever. > Why is sending the transaction proposals to all the endorsers the burden of the client instead of the client sending it to 1 endorser which spreads it around? This was considered, but ultimately abandoned I believe because of the requirement to handle byzantine endorsers required clients to submit to multiple endorsers anyhow (or there may be other reasons I'm unaware of) > Do the committers then execute the chaincode again? If not, do the endorsers send the results of their executions to the commiters? If that's the case, what governs where each committer gets it's results from? The transaction contains the result of the chaincode execution, so they simply apply the results. The application of the results is guaranteed to be deterministic even if the chaincode is not. This is why you might have heard a clever line from @garisingh (and I may mis-state) but "You don't need deterministic code because we have deterministic transactions".

jyellick
2016-12-16 14:21
@tom.appleyard ^

dave.enyeart
2016-12-16 14:25
@vukolic Sure, I will be working part-time next week. I believe you know about our bitmask for invalid trans in the BlockMetadata, and our intent to add checkpoint post-v1. Do you want to draft up the terminology changes and I will review?

vukolic
2016-12-16 14:25
there is that

vukolic
2016-12-16 14:26
there is consenters --> orderers

vukolic
2016-12-16 14:26
there is the stuff with writeset and readset

vukolic
2016-12-16 14:26
am I missing sth big?

vukolic
2016-12-16 14:28
batches --> blocks

vukolic
2016-12-16 14:28
blocks --> Vblocks (and post v1)

dave.enyeart
2016-12-16 14:29
@vukolic, @nickgaski wrote a good glossary of v1 terms to review, I’ll add my comments to that and send to both of you.

nickgaski
2016-12-16 14:29
has joined #fabric-consensus-dev

vukolic
2016-12-16 14:30
ok pls send me so we can sync all this stuff

vukolic
2016-12-16 14:30
we can add glossary to NCAP

vukolic
2016-12-16 14:30
this is a good idea

tom.appleyard
2016-12-16 15:00
@jyellick thanks again! Few more questions: >This is managed by the application How does the application know where to send them? >The transaction contains the result of the chaincode execution, so they simply apply the results. Where do these results come from, I've heard that the way the endorsers check if the chaincode has the same outcome is that 1 runs it first, sends it to the client and the client then sends the transaction proposal with this result to everyone else - is this correct? >The application of the results is guaranteed to be deterministic even if the chaincode is not. Surely the transaction would be rejected if the chaincode it calls doesn't behave in a deterministic manner? - I was under the impression endorsers check if chaincode is 'deterministic' and 'stable'

jyellick
2016-12-16 15:10
> How does the application know where to send them? The application manages which peers are participating in which chains, so it already has this information. > Where do these results come from, I've heard that the way the endorsers check if the chaincode has the same outcome is that 1 runs it first, sends it to the client and the client then sends the transaction proposal with this result to everyone else - is this correct? @muralisr Can be more precise here, but that is my understanding. The results are the readset, writeset, and postimage of the database query. @dave.enyeart may be able to be more specific. > Surely the transaction would be rejected if the chaincode it calls doesn't behave in a deterministic manner? - I was under the impression endorsers check if chaincode is 'deterministic' and 'stable' The endorsement process ensures that if the execution results in different results across different endorsers, a valid transaction cannot be formed.

tom.appleyard
2016-12-16 15:11
quick follow up question about endorsing logic - who sets this? Does it come with the chaincode when it's deployed to a peer or with the transactions from the submitter?

jyellick
2016-12-16 15:11
Chaincode deployment includes the endorsement policy

tom.appleyard
2016-12-16 15:12
>The application manages which peers are participating in which chains I''m a bit confused - when you say 'the application', what does this refer to exactly?

jyellick
2016-12-16 15:14
"The application" is the useful thing which leverages the fabric, usually built on top of the SDK

tom.appleyard
2016-12-16 15:15
so you mean the chaincode?

tom.appleyard
2016-12-16 15:16
and I suppose the assosiated `.yaml` files to set things up

jyellick
2016-12-16 15:16
No, the chaincode is a piece of the application, but think the thing which knows "Website click routes to this binding which invokes X chaincode on Y endorsers"

tom.appleyard
2016-12-16 15:18
I'm still a bit confused - so let's say I have my client machine with some client program running on it and I tell this program to send a transaction to the network. How does it know which endorsers to send the transaction proposal to? (are some hardcoded? if so how does it find the others)

muralisr
2016-12-16 15:59
@tom.appleyard the application knows what the tx is all about (transfer $x from bank b to bank c) and that determines the context for endorsement

muralisr
2016-12-16 16:01
in other words, seperating the endorsement as a “pre-consensus” (pre-ordering) step makes it closer to the application / buisiness logic layer where the application/SDK needs to know the actors of endorsement

bcbrock
2016-12-16 20:47
@kostas @jyellick My results of comparing different approaches to queuing on the broadcast side in the Kafka orderer are remarkably uninteresting. Here's my interpretation of why: These clients are able to generate and broadcast transactions much faster than they can be consumed by Kafka, and the clients are not considered complete until they have recieved an ACK for all transactions they sent. Regardless of how work is split between threads, in the end the clients are waiting for Kafka. There may be a small advantage to doing the work in multiple threads, but there is no evidence of any consistent benefit - the results look more or less random. Some kind of work queuing might provide a small latency benefit for "bursty" clients, but in these throughput-oriented runs the clients quickly stuff the queues and then work simply proceeds at a rate based on Kafka. In the future this could possibly change - if the overhead of signature checking is high for example, then it might be advantageous to split front-end and back-end work into separate threads. But for now there is no evidence that this is necessary.


kostas
2016-12-16 20:51
The normalized graphs are telling.

jyellick
2016-12-16 20:52
Thanks for the testing, I'll definitely be curious about the changes once signature validation comes online

kostas
2016-12-16 20:52
Till then I suggest we stick with the first changeset of the series.

bcbrock
2016-12-16 20:54
The one that removes all Queuing? I agree it is simplest for now, easy to add queueing back in the future for testing.

kostas
2016-12-16 20:54
Yes.

haifeng
2016-12-17 14:49
has joined #fabric-consensus-dev

vukolic
2016-12-17 19:12
@bcbrock - interesting numbers: my observation is that it seems that we do not reach the saturation with 32 clients in none of the experiments

vukolic
2016-12-17 19:13
to have an idea of the peak throughput we should be saturating the system

vukolic
2016-12-17 19:13
also - adding latency numbers would complement the experiment nicely

crazybit
2016-12-19 05:57
have a question on sbft protocol,why need to receive n-f-1 prepare msg before submiting commit msg

crazybit
2016-12-19 05:58
not f+1 prepare msg ?

garisingh
2016-12-19 09:27
@vukolic - ^^^^^

cca
2016-12-19 09:39
@crazybit - please see the Castro-Liskov paper or any good textbook (such as http://www.distributedprogramming.net), using only f+1 would be wrong

vukolic
2016-12-19 09:42
@crazybit - yes this is not sth that is explained in slack - pls see pointers by @cca

vukolic
2016-12-19 09:42
that said,

vukolic
2016-12-19 09:43
see also xft paper for when you could actually do with f+1 replicas in the loop



vukolic
2016-12-19 10:19
https://jira.hyperledger.org/browse/FAB-474 seems: 1) moved to common (which is good) and 2) backlogged - does that mean we are not having it for v1?

vukolic
2016-12-19 10:20

vukolic
2016-12-19 10:20
and maybe a few more

vukolic
2016-12-19 10:21
@kostas @jyellick ^^

zws
2016-12-19 12:21
has joined #fabric-consensus-dev

tzukru
2016-12-19 12:56
has joined #fabric-consensus-dev


vukolic
2016-12-19 14:56
@jyellick ^^ since you previously +2ed and we only need that

vukolic
2016-12-19 14:56
to get this 5-weeks old changeset in

bcbrock
2016-12-19 15:29
@vukolic The numbers in the spreadsheet from Friday do show one instance of a kind of saturation: For 32 broadcast/deliver clients with 2K blobs, the deliver side can not keep up with broadcast, and takes about 10% longer to finish delivery than to finish broadcast in this example. In order to find the real maximum throughput I would need to upgrade my system; With 64 clients @ 2KB blobs each we saturate the 10Gb network interface of the server with all of the “deliver” traffic. The benchmark code currently measures latency but does not report it yet in a user-friendly way.

vukolic
2016-12-19 15:57
thanks = so you'd expect 256 bytes experiment to saturate at (very roughly) 8x the throughput of the 2k experiment?

vukolic
2016-12-19 15:57
(which would mean 80k tps)

bcbrock
2016-12-19 16:01
Not sure; The throughput seems to be more a function of the # of clients. I can try that later today.

hgabor
2016-12-19 16:07
we would need one more +2 here: https://gerrit.hyperledger.org/r/#/c/2515/

hgabor
2016-12-19 16:07
:smile: sorry for the repost

jyellick
2016-12-19 16:24
@hgabor 2515 is in

jyellick
2016-12-19 16:25
With respect to Deliver not scaling, please keep in mind that Deliver is backed by a toy ledger right now, something used to show correctness, but was never meant to scale. Sprint 9 targets pulling in the 'real' ledger which has been developed in parallel.

jyellick
2016-12-19 16:30
Also note that the Deliver API is simplified in https://gerrit.hyperledger.org/r/#/c/3271/ which may yield performance improvements as well

kostas
2016-12-19 17:44
> and takes about 10% longer to finish delivery than to finish broadcast in this example

kostas
2016-12-19 17:44
@bcbrock: How do you define "finish" in both cases?

bcbrock
2016-12-19 17:45
For broadcast, finish means all ACKs received. For deliver, all TX delivered. (These tests broadcast/deliver a (large) fixed number of TX)

bcbrock
2016-12-19 17:47
I should also say that I have not tuned Kafka/Java, so it may be that with some work we could get broadcast/deliver rates to match

kostas
2016-12-19 17:48
Thank you. So a naive question possibly:

kostas
2016-12-19 17:48
> For 32 broadcast/deliver clients with 2K blobs, the deliver side can not keep up with broadcast, and takes about 10% longer to finish delivery than to finish broadcast in this example.

kostas
2016-12-19 17:48
This does not necessarily imply a saturation though, does it?

bcbrock
2016-12-19 17:49
How are you defining saturation?

kostas
2016-12-19 17:49
Or rather, what kind of saturation do you talk about here?

kostas
2016-12-19 17:49
Ah, you beat me to it.

bcbrock
2016-12-19 17:49
I meant saturation as delivery can not keep up with broadcast

bcbrock
2016-12-19 17:50
I think @vukolic may have meant that we hadn’t seen throughput roll-over, so we don’t know what the max. really is

kostas
2016-12-19 17:50
Got it, thank you.


kostas
2016-12-19 17:52
@vukolic: Anything that is not in for sprints 8 or 9 goes into backlog.

vukolic
2016-12-19 19:59
Whats the def of end of sprint 9?

kostas
2016-12-19 20:37
This week and the next week are sprint 8.

kostas
2016-12-19 20:38
The two weeks that follow are sprint 9, etc.

kostas
2016-12-19 20:55
So: Jan 15

vukolic
2016-12-19 21:08
c'est trop compliqué

kostas
2016-12-19 21:20
ne tuez pas le messager


vukolic
2016-12-19 22:01
hopefully brings arch document closer to real naming

crazybit
2016-12-20 01:31
@cca @vukolic thanks

crazybit
2016-12-20 03:31
one more question,looks the sftb test cases all not reach to checkpoint step,is it expected? or the checkpoint process step implementation not completed,so skip this step at this stage?

kostas
2016-12-20 05:41
@hgabor: I'm running the unit tests on the entire `orderer` package, and the `sbft` ones seem to run for well over a minute.


kostas
2016-12-20 05:42
When you find some time, could you please add the `Short()` check to them to make testing for everything else a bit faster?

kostas
2016-12-20 05:42
(Don't worry about the failure on the screenshot, I Ctrl+C'd after realizing it was taking longer than usual.)

vukolic
2016-12-20 07:44
@crazybit which test cases? do you have a debug log of these runs?

vukolic
2016-12-20 07:44
normally all nodes run through checkpoint phase

hgabor
2016-12-20 08:15
@kostas for example now :slightly_smiling_face:

hgabor
2016-12-20 08:20
I will add something like this: if short: skip

hgabor
2016-12-20 08:20
if I am right we also have to add the short option to the test run

hgabor
2016-12-20 08:20
e.g. go test -short (or the proper form)


hgabor
2016-12-20 09:15

hgabor
2016-12-20 09:33
I will also have to modify the way tests are started

ruslan
2016-12-20 10:01
has joined #fabric-consensus-dev


kostas
2016-12-20 12:18
Correct. (And thanks!)

hgabor
2016-12-20 13:50
@garisingh others love this: https://gerrit.hyperledger.org/r/#/c/3419/3 you can +2 it :smile:

garisingh
2016-12-20 13:51
hehe - done :wink:

jonathanlevi
2016-12-20 13:51
For the love of shorter testing cycles! Merged.

hgabor
2016-12-20 13:56
@jonathanlevi note that for that we also need this: https://gerrit.hyperledger.org/r/#/c/3421/2 without this it is only a half armed giant

jonathanlevi
2016-12-20 13:57
OK. Let’s wait for the build to complete.

jonathanlevi
2016-12-20 13:58
+1 on “negotiation skills” :wink:

tom.appleyard
2016-12-20 16:32
@jyellick @muralisr I'm still a bit lost as to how a submitting client know which endorsers to send a transaction proposal to - you said 'the application' knows but I'm a bit lost as to what this refers to?

jyellick
2016-12-20 16:41
@tom.appleyard You can think that 'the application' instructs peers to join a chain, and also deploys chaincode. So, it makes sense that 'the application' knows which peers are capable of endorsing a proposal, and which endorsements are required (because again, 'the application' was the one who initially decided what these requirements were)

tom.appleyard
2016-12-20 17:00
I'm still confused :confused: Perhaps it might help if I explain what I understand about the current way of doing development: (on the subject of what 'the application' means) From what I understand when you develop using hyperledger as your platform what you are doing is making a collection of chaincode files which contain go (at least for now) code that reads and writes to a 'world state'. You can control who is able to make calls to this go code from defining users in the membersrvc.yaml and giving them groups and therefore rights. This and core.yaml are also used to change how the peers and member service behave (such as whether attributes are in tcerts, consensus alg to use etc.). Finally you can also make front-end apps with node.js and the hfc library which allow you to interact with the peers, deploying chaincode and sending transactions. Using an hfc app requires you to log in with one of the users you defined in membersrvc.yaml. We can also use HFC to register new users and change the rights and privileges of existing users. Obviously the above is based on knowledge of v0.6 but in the above context what is 'the application' - is it the chaincode? the yaml? the hfc app? or all of these together? (on the subject of how do submitters know where to send endorsers) Now bringing this into v1.0 I presume it will work in more or less the same way (i.e. we have chaincode pushed to endorsers whose behaviour is controlled by these yaml files). If I could just clarify - what is a submitter in this setup? Would the hfc app mentionned previously be the submitter - the name 'client submitter' simply referring to the fact that it is not a peer instead of meaning it runs on a user's machine. Now regarding where to send transaction proposals, how would this submitter know what the addresses of the endorsers are? If the submitter is an hfc node app running on a server somewhere does it have the IP addresses hardcoded? How does it know if endorsers are taken down/crash/are added? (apologies if I've misunderstood some of the key bits of how this works)

tom.appleyard
2016-12-20 17:00
@jyellick

jyellick
2016-12-20 18:42
The big point of divergence between 0.6 and the new architecture is that there is no longer a single point of authority in the form of membership services. You can think for instance that you have three organizations(A,B,C) participating in the blockchain, none of which has sole authority to do anything. Each organization will have some sort of administrator who deploys peers, and instructs them to participate in a given chain. Obviously organization A's admin can't provision peers for organization B, and organization B can't instruct organization A's peers to participate in a particular chain. Similary, if you wish to do something like deploy a chaincode, because this code will execute on every endorsing peer participating on the chain, usually it's not sufficient for a single organization to endorse the deployment of a chaincode. So, when a chain is created (or reconfigured) the participating orgs need to agree on policies for chaincode deployment. That policy might be that every organization needs to agree, or, it might be that only one special 'dictator' needs to agree (or some other more complicated scheme), but this policy governs what endorsements are required for chaincode deployment. Because this consortium created that policy, it inherently knows what the endorsement requirements are. When it comes to actually finding endorsing peers, the expectation is that when an organization agrees to participate in a chain, it will provision some peers, and join them to the chain. However, rather than enforce that the network tracks a full list of all peers, those peers which the org admin wants to designate as endorsers can be reported (via whatever mechanism is appropriate) to the consortium so that any of these peers may be targeted when a chaincode requires endorsement from that particular organization. It is possible to build some layer on top of the fabric to track which peers are available (in fact, the gossip piece does this to some extent), but there is no requirement to do so. Note that I haven't really discussed users etc., because this is all 'chain management'. The individual user rights within an organization for a particular chaincode are all still managed as you indicated (this is to the best of my knowledge, that is leaving my domain of expertise). The 'client submitter', is generally the hfc app you described, I would not typically expect for this code to execute on a user's machine. Naively and out of the gate, I would expect that yes, hfc app developer would simply require a list of orgs and their endorsers, and it would randomly pick among them as needed (switching on failure etc.). For a more robust deployment, I would actually see the peer lists maintained via a chaincode or other API so that the manual distribution of machine names could be eliminated, but it's certainly not a prerequisite to having a working application.

jyellick
2016-12-20 18:42
@tom.appleyard ^

vukolic
2016-12-20 21:50

vukolic
2016-12-21 09:33
this one would profit from no-review-delay https://gerrit.hyperledger.org/r/#/c/3457/

tom.appleyard
2016-12-21 12:28
@jyellick ok I see that shines some light, a few follow up questions: >there is no longer a single point of authority in the form of membership services. >individual user rights within an organization for a particular chaincode are all still managed as you indicated Does this mean everyone has their own CA? If there's no central authority can any CA verify any transactions (i.e. one organisation's CA verifying signatures of another). Who should I ask about this? >Each organization will have some sort of administrator who deploys peers, and instructs them to participate in a given chain. Just to check, when you say 'instructs them to participate in a given chain' what this means is that it subscribes the endorsers to particular channels on the consensus service (in order to get transactions for said chain)? >So, when a chain is created (or reconfigured) the participating orgs need to agree on policies for chaincode deployment... >...Because this consortium created that policy, it inherently knows what the endorsement requirements are. From what I understand about how the transactions now work the process is like this: 1. Client Submitter submits transaction proposal to endorsers 2. Endorsers send yes/no responses back to Client Submitter 3. Client Submitter (if the endorsement policy is met) sends transaction to committers or (if it is not) discards the transaction. 4. Committers verify that the endorsers who endorsed the transaction actually did 5. Committers update the relevant ledgers with the outcome of the transaction As such, how does a client submitter know what the endorsement policy of a particular transaction would be? Are client submitters owned by organisations? Does sending transactions to the committers work in the same way as sending them to endorsers – the client submitters is responsible for knowing which ones to send it to and sending it to them? >endorsers can be reported (via whatever mechanism is appropriate) to the consortium Would I be correct then in thinking that 'where' and 'who' the endorsers are is just tracked by some system – it doesn't really matter what (and indeed you can not do it). The point is that this system isn't part of the Hyperledger Fabric, it would just work with it. Expanding this a bit, I would I be correct in thinking that when you add new organisations to the network this 'system' would be informed of new peers joining and as such would update the others telling them who and where the new ones are as well as what role they perform?

yacovm
2016-12-21 12:48
Hey, anyone home?

weeds
2016-12-21 13:27
@garisingh can you help with some of these questions toda?

weeds
2016-12-21 13:27
(lot of people are out for the holidays now)

vukolic
2016-12-21 13:54
sbft quorum size optimization here https://gerrit.hyperledger.org/r/#/c/3459/

tom.appleyard
2016-12-21 14:28
Also, @jyellick regarding the 'ordering service' you mentionned, I'm not finding any references to it here: https://github.com/hyperledger/fabric/blob/master/proposals/r1/Next-Consensus-Architecture-Proposal.md Am I looking at an out of date document?

jyellick
2016-12-21 14:45
@tom.appleyard > Does this mean everyone has their own CA? In general, every organization has its own CA, it's not a requirement, but in the interest of decentralized authority, this is a feature. > If there's no central authority can any CA verify any transactions (i.e. one organisation's CA verifying signatures of another). Who should I ask about this? No, the chain configuration contains a list of all the MSP definitions (membership service providers) for the chain. This means that anyone with a copy of the chain configuration (which is itself a subset of 'the chain') can verify the signatures of any of the transactors on the chain. The chain config also embeds policies such as "All three of orgs A, B, and C must agree before adding a new MSP to the chain". > Just to check, when you say 'instructs them to participate in a given chain' what this means is that it subscribes the endorsers to particular channels on the consensus service (in order to get transactions for said chain)? There is a `JoinChain` RPC that the peer supports which takes a genesis block for a chain and causes that peer to then retrieve a copy of that chain, and to process updates and endorsements, etc. It doesn't explicitly require contact with the ordering service (it could also catch up and get updates through gossip) but it might. > 1. Client Submitter submits transaction proposal to endorsers > 2. Endorsers send yes/no responses back to Client Submitter > 3. Client Submitter (if the endorsement policy is met) sends transaction to committers or (if it is not) discards the transaction. > 4. Committers verify that the endorsers who endorsed the transaction actually did > 5. Committers update the relevant ledgers with the outcome of the transaction Broad strokes, correct, but in 3, the client sends the transaction to the ordering service for ordering, then eventually the committer gets a batch (which is really just a block which potentially contains some invalid transactions), Then 4 and 5 happen. > As such, how does a client submitter know what the endorsement policy of a particular transaction would be? Are client submitters owned by organisations? The 'client' which builds the transaction is generally 'the application'. The application deployed the chaincode, so the application knows the endorsement requirements, so the client knows these requirements. > Does sending transactions to the committers work in the same way as sending them to endorsers – the client submitters is responsible for knowing which ones to send it to and sending it to them? As I mentioned, the client never sends the transaction directly to the committers, the client sends the transaction to ordering, and the committers eventually receive the ordered transaction (assuming it was well formed). > Would I be correct then in thinking that 'where' and 'who' the endorsers are is just tracked by some system – it doesn't really matter what (and indeed you can not do it). The point is that this system isn't part of the Hyperledger Fabric, it would just work with it. Correct. Eventually there may be a standard or recommended way to do this, but nothing like that is targeted for v1.

jyellick
2016-12-21 14:45
> Expanding this a bit, I would I be correct in thinking that when you add new organisations to the network this 'system' would be informed of new peers joining and as such would update the others telling them who and where the new ones are as well as what role they perform? It's important never to conflate "peer" (the process) and "peer organization" (one of the entities participating in the blockchain). When a peer organization joins, this requires updating the chain configuration, which is propagated through the chain, and everyone knows. Joining an individual peer has no such requirement. > regarding the 'ordering service' you mentionned, I'm not finding any references to it here: "Ordering service" is synonymous with "consensus service", but because of differing interpretations of the word "consensus" we decided to clarify more explicitly what the service was providing with a new term that did not carry any baggage.

jyellick
2016-12-21 14:46
> Am I looking at an out of date document? Conceptually it is still mostly correct, but the terminology and some details have changed, we need to bring this document up to speed.

kostas
2016-12-21 14:46
(There is a changeset out there by @vukolic that brings it up to date BTW.)

garisingh
2016-12-21 15:06
we should probably merge it and then I can update with my mostly grammatical / syntactical edits.

garisingh
2016-12-21 15:08
we also need to get rid of the version in the v0.6 branch

tom.appleyard
2016-12-21 15:08
@jyellick >the client sends the transaction to the ordering service for ordering Would I be able to have some more details on this step then? Which nodes are involved in the ordering service, who is in charge of running them (as in which organisation on the network)? how do they agree on the order? >The 'client' which builds the transaction is generally 'the application'. The application deployed the chaincode, so the application knows the endorsement requirements, so the client knows these requirements. when we say 'client', I take it we are talking about an hfc node app? With this in mind, what would happen if you wanted another instance of this hfc app (say for load balancing) which would not have deployed the chaincode, how would it know the endorsement policy (or would it be supplied with it for all chaincodes it can handle at setup)? >membership service providers What is an MSP vs. a membersrvc from v0.6? >anyone with a copy of the chain configuration (which is itself a subset of 'the chain') can verify the signatures of any of the transactors on the chain Becuase they would know who to contact about verifying someone from a speicific organisation? @kostas where would I be able to find this changeset?


tom.appleyard
2016-12-21 15:08
thanks!

vukolic
2016-12-21 15:08
we need just one more +2 to merge that

vukolic
2016-12-21 15:08
and make it more readable

vukolic
2016-12-21 15:09
(by the virtue of merging and mirroring to github)

kostas
2016-12-21 15:26
@hgabor: The `sbft` package leaves behind a `main` binary (in `sbft/main/`) _every time_ the unit tests are run. (This is most likely related to the way the files are structured inside your `main` folder, where there are `xxx_test` files w/o the associated `xxx` files.) Can you look into it when you get a chance?

hgabor
2016-12-21 15:36
@kostas that main is the "main app"/"executable" that start an sbft peer. one of the tests create that, calling go build. I can add a line to remove it, it that OK?

kostas
2016-12-21 15:46
I haven't looked at your tests' source code to know what they do exactly. There might be a better way to take care of the issue, but I won't know until I see the code. In general though, the tests shouldn't leave any artifacts behind.

hgabor
2016-12-21 15:48
I think the best and easiest way of solving this is removing that artifact.

jyellick
2016-12-21 15:50
> Would I be able to have some more details on this step then? Which nodes are involved in the ordering service, who is in charge of running them (as in which organisation on the network)? how do they agree on the order? Aha! Finally a question which is more native to #fabric-consensus-dev The short answer is 'it depends'. The ordering service has a very simple api surface with two methods `Broadcast` and `Deliver`. Transactions are injected through `Broadcast` and the ordered batches/blocks are retrieved by calling `Deliver`. The backing consensus implementation and who runs this service, there are multiple options. For v1 as the first class citizen is a Kafka shim which leverages the CFT nature and high throughput of Kafka. We are also working on a pbft based solution (`sbft`) which provides BFT, but it will be more on the 'experimental' side for v1. > What is an MSP vs. a membersrvc from v0.6? This is a great question, because I personally find the terminology confusing. The MSP in 0.6 provided T-certs and other network services for getting enrollment certs etc. In v1 these MSPs simply provide a crypto implementation, such as say X.509 with a CA. The idea is to make the crypto scheme pluggable but @elli @adc @aso might be better able to explain this. > Becuase they would know who to contact about verifying someone from a speicific organisation? Because the chain configuration embeds all the material needed to instantiate the MSP for all participating organizations to verify signatures with.

jyellick
2016-12-21 15:51
^ @tom.appleyard

vukolic
2016-12-21 16:07
hm - i dislike that label 'experimental' for bft :wink:

jyellick
2016-12-21 16:33
Everyone heard it! @vukolic is going to have `sbft` production ready and bullet proof by March

joshhus
2016-12-21 16:45
Please flag @joshhus when content on here is good for external v1.0 doc. Hard to keep track of all slack channels, thanks!

vukolic
2016-12-21 16:47
@jyellick `We are also working on a pbft based solution (sbft) which provides BFT.' would suffice

yuki.k
2016-12-21 16:58
has joined #fabric-consensus-dev

srirama_sharma
2016-12-21 17:04
@jyellick, we are seeing unit-tests failure after below commit was done. ```commit cf03f20e0ed7ba5fb5f0afc6be1bf8cfe260d187 Merge: a401b28 bfd7c84 Author: Jason Yellick <jyellick@us.ibm.com> Date: Mon Dec 19 16:18:27 2016 +0000 Merge "Add networked stress tests for SBFT" ```

srirama_sharma
2016-12-21 17:04
The failing unit-test case is ```unit-tests_1 | FAIL unit-tests_1 | coverage: 49.2% of statements unit-tests_1 | FAIL http://github.com/hyperledger/fabric/orderer/sbft/main 834.346s ```

srirama_sharma
2016-12-21 17:05
Any thoughts on what could be architecture dependent in this as I see that these unit-test aren't failing on x86_64 and Z

srirama_sharma
2016-12-21 17:09
It could also be some timing issue as I see grpc time out messages in the log

vukolic
2016-12-21 17:14
can you pls send me the full log on dm

vukolic
2016-12-21 17:34
@hgabor sharing the log with you on DM

vukolic
2016-12-21 17:34
seems grpc connections are not established at all

jyellick
2016-12-21 18:07
@srirama_sharma sounds like @vukolic and @hgabor are helping you. My name is on that commit because I clicked the merge button, but it is @hgabor's code so he is likely to be more useful than I

hgabor
2016-12-21 18:26
wasn't there a Z specific problem with grpc? I am not sure

vukolic
2016-12-21 18:42
this is power

mohamoudegal
2016-12-21 18:47
has joined #fabric-consensus-dev

mohamoudegal
2016-12-21 18:48
Hi everyone,

mohamoudegal
2016-12-21 18:48
I’m new to the community and I had a bug I ran into with one of the Fabric tutorials

mohamoudegal
2016-12-21 18:49
@mohamoudegal uploaded a file: https://hyperledgerproject.slack.com/files/mohamoudegal/F3JBW031U/screen_shot_2016-12-21_at_11.45.24_am.png and commented: I inputted the right info from the credentials file on Bluemix, but I’m getting a syntax error message. Please advise.

mohamoudegal
2016-12-21 18:50
Also does anyone know why the chaincode.go & chaincode_finished.go are completely different?

kostas
2016-12-21 19:32
@mohamoudegal #fabric-dev is probably a better place for these questions

mohamoudegal
2016-12-21 19:33
@kostas thanks

xixuejia
2016-12-22 00:21
Hi all. As for v1.0, can I say there will be no dependency among the transactions inside the same block? e.g. tx1 to transfer assets from A to B, tx2 is to transfer asset from B to C, these 2 transactions would never be in the same block because the readset and writeset for them during endorsement will not be satisfied in the phase of committing to ledger. Is my understanding correct?

bcbrock
2016-12-22 00:40
@xixuejia Blocks are created without regard to the contents of the transactions. Under reasonable assumptions those TX could be committed as long as they appear in the correct order in the block.

xixuejia
2016-12-22 01:11
Thanks! @bcbrock So, it's the client and committers responsibility to assure the situation I mentioned above to succeed?

umasuthan
2016-12-22 03:52
Hi, If someone point me to documentation/ reading material on how consensus work in Hyperledger, that will be helpful. I want to understand the flow of actions that happen from the point a client initiates a transaction to the ledger reconciled to world-state, which component does what, and the type of messages exchanged. Thank you!

bcbrock
2016-12-22 03:54
@xixuejia There is an implicit specification (I haven’t seen it formally specified) that if a single client sends two transactions for consensus, then they will be committed in the order they were sent. This should certainly be true if the client waits for acknowledgment from the ordering service after each transaction. Neither the ordering service nor the the committer will reorder transactions to remove conflicts. (I personally think this might be an interesting concept to explore, though.)

bcbrock
2016-12-22 03:57
@umasuthan You can start by looking at the documents here: https://github.com/hyperledger/fabric/blob/master/proposals/r1

umasuthan
2016-12-22 04:00
@bcbrock, sure. Thanks for the pointers. Since there are multiple actions happen, like ordering of transactions, block creation, and so on, I am bit confused as to what are the sequence of actions and what constitute the success or failure of consensus

bcbrock
2016-12-22 04:04
Some presentations are attached here: https://jira.hyperledger.org/browse/FAB-37

xixuejia
2016-12-22 04:05
thanks @bcbrock

xixuejia
2016-12-22 04:09
so, a single client send tx1 and wait for block generated,then it sends next tx for endorsement.Only this way can we make sure these 2txs succeed if they have dependency

umasuthan
2016-12-22 04:22
Thank so much @bcbrock

hgabor
2016-12-22 07:56

vukolic
2016-12-22 08:52
@bcbrock there is no such requirement as this is impossible in case we use TCert like mechanism to anonymize clients

vukolic
2016-12-22 08:53
If clients are not anonymized i.e., when they use sth like ECerts it is possible but it was never stated as a clear requirement to my knowledge

vukolic
2016-12-22 08:54
Of course we are talking about clients submitting requests without waiting for request commit confirmation

vukolic
2016-12-22 08:56
Intuitively with TCert anonimity and unlinkability one cannot link the two transactions and establish causal precedence between them

dgorman
2016-12-22 09:37
has joined #fabric-consensus-dev

haixu
2016-12-22 10:00
has joined #fabric-consensus-dev

srirama_sharma
2016-12-22 11:58
@hgabor @vukolic Thanks for looking into this. To clarify, this grpc time out issue is seen only on Power (ppc64le) architecture. It seems on z, the unit-tests are going through without issues as per confirmation from @harrijk who is the Z lead guy who is running unit-tests on fabric every 3 hrs :slightly_smiling_face:

hgabor
2016-12-22 11:59
I will have to look into grpc's power pc specific bugs (knows ones)

hgabor
2016-12-22 11:59
"unit-tests_1 | 2016/12/21 12:08:11 grpc: Server.Serve failed to complete security handshake from "127.0.0.1:59654": EOF"

hgabor
2016-12-22 12:00
my guess would be that this is the key. I have never ever seen this

srirama_sharma
2016-12-22 12:07
@hgabor any suggestion on what I could try and check ?

hgabor
2016-12-22 12:09
how exactly is it run? with our without vagrant?

srirama_sharma
2016-12-22 12:19
Not using vagrant. On ppc64le, we just have the docker environment.

srirama_sharma
2016-12-22 12:19
I am just doing `make uni-test` to verify the fabric build

hgabor
2016-12-22 13:27
the unit tests run in a docker as I remember

hgabor
2016-12-22 13:27
so you use the same grpc as the others

srirama_sharma
2016-12-22 14:16
Yes. Isn't it part of base-image itself ?

kostas
2016-12-22 15:17
@hgabor: Commented on the changeset. (Short answer: yes, but there are a few things that need to be taken care of. Thanks!)

vukolic
2016-12-22 15:58
@kostas @garisingh so we in principle need a JIRA issue for every changeset?

vukolic
2016-12-22 15:58
I am totally not respecting this - but I was not aware - let me know if I should start

bcbrock
2016-12-22 16:00
@vukolic I think it would be reasonable to specify that if a single thread in a single client sends two transactions over a single gRPC connection to the ordering service, then those transactions should appear in the client’s transmission order in the final block, regardless of TCerts, ECerts, etc. It would also be reasonable not to guarantee this for many reasons. But it needs to be clearly specified which is true.

vukolic
2016-12-22 16:01
hm, again not really possible at least in BFT case

vukolic
2016-12-22 16:01
I could probably try to write a proof for it :slightly_smiling_face:

vukolic
2016-12-22 16:02
the issue is that these requests would be linked within a grpc connection but it may take more time to order them

vukolic
2016-12-22 16:02
and they would be changing hands a lot

vukolic
2016-12-22 16:02
now - if we can link the request we can deliver FIFO (causal order) that you need here

vukolic
2016-12-22 16:02
but with unlinkability of tcerts - this would not fly

vukolic
2016-12-22 16:03
it is probably impossible with unlinkability even in crash-tolerant case

vukolic
2016-12-22 16:03
(deterministically and in all executions)

bcbrock
2016-12-22 16:04
This ordering is currently true with the Kafka orderer, I’m don’t understand what the certificates have to do with it. If 2 TX come in over a GRPC connection, the second TX is not even sent to Kafka until the first is known to be persisted.

vukolic
2016-12-22 16:05
so if the service mimicks closed loop

vukolic
2016-12-22 16:05
then it is possible

vukolic
2016-12-22 16:05
if you allow open loop does not seem so

vukolic
2016-12-22 16:06
in pure BFT case this closed loop trick would not seem to help

vukolic
2016-12-22 16:06
unless you do closed loop from the client

vukolic
2016-12-22 16:06
and client does not submit n+1 until it sees n in ledger of some peer

bcbrock
2016-12-22 16:07
Does the client need to see n+1 in a ledger, or is the ACK from the broadcast service sufficient?

vukolic
2016-12-22 16:08
ack meaning what - what is the semantics?

vukolic
2016-12-22 16:08
above I was talking about open loop *Of course we are talking about clients submitting requests without waiting for request commit confirmation*

bcbrock
2016-12-22 16:09
ACK meaning receipt of the BroadcastResponse

vukolic
2016-12-22 16:09
for closed loop we may be ale to do sth

vukolic
2016-12-22 16:09
hm you seem to be talking about solo/kafka specific stuff that I am not familiar with so much

vukolic
2016-12-22 16:09
let me speculate on what it can be

vukolic
2016-12-22 16:09
if this is an ack from a single orderer - then clearly no because a byz orderer may "forget" to fwd it say to the leader

bcbrock
2016-12-22 16:10
All AtomicBroadcast services must return the BroadcastResponse, no?

vukolic
2016-12-22 16:10
well the spec says there is broadcast and deliver

vukolic
2016-12-22 16:10
nothing about broadcast response

vukolic
2016-12-22 16:10
what would be the properties of broadcast response?

vukolic
2016-12-22 16:10
(diff from deliver)

bcbrock
2016-12-22 16:13
That is the question :slightly_smiling_face: The BroadcastResponse in the Kafka case means that the TX has been ordered and persisted. A question though about the specification, are the proto definitions considered part of the specification, or simply one implmentation of the specification?

vukolic
2016-12-22 16:14
"ordered and persisted" seems to me as good as delivered

vukolic
2016-12-22 16:14
re spec/impl

vukolic
2016-12-22 16:14
I'd say they are currently the impl - until we are sure what is possible and what is not and what can be "standardized" by a proto

bcbrock
2016-12-22 16:15
Almost as good, it may not have been assigned to a block yet.

vukolic
2016-12-22 16:15
now this is a very specific way kafka and solo operate

vukolic
2016-12-22 16:15
because they assign a block after order

vukolic
2016-12-22 16:15
in all bft protocols I know it is the other way around

vukolic
2016-12-22 16:16
generate block and then order

vukolic
2016-12-22 16:16
if you do it other way around (as in solo/kafka) you would

vukolic
2016-12-22 16:16
1) murder throughput

vukolic
2016-12-22 16:16
2) require very deterministic block sizes set upfront (blockcutter needs to be deterministic)

vukolic
2016-12-22 16:16
so this is very impl thingy

bcbrock
2016-12-22 16:19
Interesting. So I think you would answer @xixuejia question by saying that in general you have to wait for TX1 to be delivered before sending TX2 if you want them in order?

vukolic
2016-12-22 16:20
I did not say that :slightly_smiling_face:

vukolic
2016-12-22 16:20
I said that this is the case if you want unlinkability

vukolic
2016-12-22 16:20
i.e., tcerts

vukolic
2016-12-22 16:20
if not - we can do FIFO/causal

vukolic
2016-12-22 16:20
but we need timestamping at the client

vukolic
2016-12-22 16:20
which means - no unlinkability

vukolic
2016-12-22 16:21
in a sense causal and unlinkability do not go well hand in hand

vukolic
2016-12-22 16:21
it is, roughly speaking, one or the other

vukolic
2016-12-22 16:21
or both but in closed loop

bcbrock
2016-12-22 16:28
How does waiting guarantee unlinkability?

bcbrock
2016-12-22 16:28
Why does it matter whether there are 0 or N TX between two TCert-anonymized TX?

vukolic
2016-12-22 16:29
waiting serves to guarantee fifo trivially

vukolic
2016-12-22 16:29
i submit first wait to see it appears and then I go to the second

vukolic
2016-12-22 16:30
there i have fifo (albeit a non interesting one)

vukolic
2016-12-22 16:30
and then the ordering can have unlinkability

vukolic
2016-12-22 16:30
and there you have two


vukolic
2016-12-22 16:30
but you have closed loop

kostas
2016-12-22 16:33
@vukolic: Correct. Component `fabric-consensus`, label `sbft`, and the sprint during which this is tackled. Mark the issue as "In Review" when the changeset is posted, and mark as "Done" when it's merged. A bit of a process, but you quickly get used to it. (Thanks!)

kostas
2016-12-22 16:33
@vukolic: Correct. Component `fabric-consensus`, label `sbft`, and the sprint during which this is tackled. Mark the issue as "In Review" when the changeset is posted, and mark as "Done" when it's merged. A bit of a process, but you quickly get used to it. (Thanks!)

kostas
2016-12-22 16:33
@kostas pinned a message to this channel.

vukolic
2016-12-22 16:33
ok...

vukolic
2016-12-22 16:34
the only changeset I tagged with JIRA is not getting merged though

vukolic
2016-12-22 16:34
so I wonder about the pragmatism of following this :slightly_smiling_face:

vukolic
2016-12-22 16:34
(kidding)

kostas
2016-12-22 16:35
Ah, I reviewed this last night, forgot to +1.

vukolic
2016-12-22 16:35
yeah yeah :wink:

kostas
2016-12-22 16:36
It's true (though I get the timing is fishy). I remember my conversations with Christian on this, and I had actually read the chapter on his book, so it was easy to check.

vukolic
2016-12-22 16:38
I always have to recalculate those

vukolic
2016-12-22 16:38
the numbers do not stick to brain

joshhus
2016-12-22 16:50
FYI ... I'm drafting a Hyperledger v1.0 overview doc, at a medium-high level (est. one-half page per component, e.g.) so I am interested in this discussion on diffs between SOLO, Kafka and SBFT, for an audience of general HL readers. If these diffs / explanations get compiled at some point please LMK. Thanks!

joshhus
2016-12-22 16:56
Working on doc-ing this for external ... SOLO, Kafka, vs. SBFT. mid-high level.

kostas
2016-12-22 16:58
@joshhus Will do. Most of the team is on vacation these days, but we'll get something going in sprint 9.

joshhus
2016-12-22 17:13
@kostas Right, just fyi. I'm committed to some kind of first draft / reviewable for Sprint 8. But placeholders for these details is fine for Spring 8 / short term, thanks.

joshhus
2016-12-22 17:18
@kostas i.e. working on an HL v1.0 overview doc at mid-high level. ...

scottz
2016-12-22 19:38
@vukolic @bcbrock @xixuejia My understanding is that definitely it is the client that must manage dependencies between transactions, ensuring T1 is committed before the 2nd proposal is submitted, or at least before T2 is broadcast to the orderers; if the sdk plays risky by getting endorsement for T2 and/or broadcasting T2 before T1 is committed, then it risks failure. In v0.6, transactions could be processed out of order, and I don't think the ordering sevice of v1.0 guarantees transaction submission order will be same as commit order, either.

scottz
2016-12-22 19:38
I see how the ordering of transactions via trivial FIFO can be guaranteed by the client by waiting for an event notification that the transaction (successful or failed) has been written in a block to ledger (or earlier, such as when it is Delivered by ordering system, if an event is raised at that point). But, does the discussion about ordering transactions actually address the question, given the v1.0 architecture? If the client application submits the 2nd transaction proposal before the 1st one is ordered and/or committed, then I would think we could agree that the 2nd may fail during validation (if not before) if the first has not been validated and committed already. But maybe the question is: would it be possible that the 2nd (dependent) transaction fail during ENDORSEMENT phase, when the first has not been committed? (How do we know if B has the money from A yet, to transfer it to C? Is that dependency test done during endorsement or validation? I guess answering this would answer my question.) If so, then there may be other problems to deal with: one thought is that the SDK might have to be able to ensure that the transactions would be broadcast to the ordering service in the same order that the transaction proposals were submitted to the endorsers.

xixuejia
2016-12-23 00:28
@scottz Thanks so much for the detailed explanation. I think that's basically in accordance with my understanding. So, in short, the client has to wait for commitment of T1 and then send proposal for T2(assume T2 can succeed in both endorsement and validation if there's no T1) if T1 will change the state of readset and/or writeset of T2.

xixuejia
2016-12-23 00:33
so in this case, endorsement test of T2 will pass with our assumption above, but the validation of T2 will fail

xixuejia
2016-12-23 00:38
in this case, the client will wait for an uncertain long time for the block containing T1 generated. (the commit of T1 is asynchronous, how does the client know when the commit of T1 is done? by polling?) It seems there's no good support for a client to submit a large number of transactions if some of the transactions have dependency(weak dependency, each transaction can survive endorsement and validation if it's the only tx to be submitted, but may fail validation or endorsement if they are submitted by the client in batch)

umasuthan
2016-12-23 04:55
Hi @scottz, Can you provide some pointer on how the ordering and commit happens in v 0.6? thanks

scottz
2016-12-23 15:01
@xixuejia yes overall throughput will be lower when a client must submit many transactions with many dependencies. Maybe the client applications can implement a resend mechanism; just send many transactions with assumption that the order will remain the same and most will complete fine, which is probably true, and just be prepared to re-propose any transactions that fail validation. In v1.0, the client can register for event notification, It is not fully developed yet. For example, client could request all notifications for a particular transaction - such as when client has other transactions to send which are dependent on the results being settled. Or maybe could request all event notifications per transaction classification - such as all successes or all failures, etc. That is my understanding of the intentions, but I must mention that I do not really know the full API how it will work. Maybe you could get more answers from the authors of the Common SDK API https://docs.google.com/document/d/1R5RtIBMW9fZpli37E5Li5_Q9ve3BnQ4q3gWmGZj6Sv4/edit#heading=h.z6ne0og04bp5

scottz
2016-12-23 15:18
@umasuthan v0.6 does not have separate endorser peers. the architecture is much simpler and all functions are combined into a single set of peer nodes. If using a secure and fault-tolerant consensus algorithm such as PBFT, as is provided with the BlueMix networks for example, then clients submit transactions to any peer of a network of 4 peers, which talk to each other to create batches. Imagine many users sending Invoke Transactions to the peers in parallel; the peer consensus network needs to order them, peers all compute and agree on answers, and then commit to ledger. And there is only one chain in v0.6, whereas v1.0 has multiple concurrent chains for different business transaction networks established by clients. Note (in both v0.6 and v1.0) the complexity with ordering is from the distributed network and the parallel processing; but the true difficulties and problems arise when there are breakdowns such as when one or more peer nodes are disconnected or restarted - which causes some transactions to get queued or even lost. And then peer nodes must re-sync with the network when they reconnect, which causes more delays and catch-up algorithms that could lead to unexpected ordering of transactions (different than the order proposed/requested by clients).

umasuthan
2016-12-23 16:53
Thank you so much for the detailed explanation @scottz. This is very helpful. As a follow up question, if all peers are executing the transaction, how is the data privacy maintained? I understood one of the aspects is that the peers will have visibility to data on needs basis. How is that achieved in v 0.6 (and 1.0)

kostas
2016-12-23 17:04
This is not achieved in v0.6, because of what you pointed out - all validating peers need to execute the transaction. In v1.0, this concept of channels cannot be carried over to the SBFT work for the same reasons. You're probably looking at a sidechains-like construct if you want data privacy at a BFT-based network.

yacovm
2016-12-23 17:09
The `ab.proto` defines different kinds of ways to pull blocks from the ordering service. In example, you can pass max UINT64+block_until_ready and get a never-ending stream of blocks as they are created. But- what is the default behavior of the ordering service client? has it ever been discussed?

scottz
2016-12-23 17:10
In v1.0, only the authorized users/peers/organizations of a particular channel can access that channel ledger. Based on that, if we abstract out the orderer service (since it will be a protected portion of the entire network), then from customer perspective, this answers the question at a high level.

umasuthan
2016-12-23 17:11
Ok. in V0.6 all peer will have equal access to data

kostas
2016-12-23 17:11
@yacovm: "Default"? It's essentially a server responding to your requests, right? There is no default.

yacovm
2016-12-23 17:12
There is a default because you can ask `block until ready`, or specify seek behavior: `FAIL_IF_NOT_READY = 1;`

yacovm
2016-12-23 17:12
And you can simply not pass max UINT64 but only pass the next block you need, etc. etc.

yacovm
2016-12-23 17:12
I asked about the client's behavior

umasuthan
2016-12-23 17:13
If I am using Blockchain for say an health insurance claim, then all parties need not see all data of the patient. That is the context in whic eI was asking about data privacy.

kostas
2016-12-23 17:14
I understand that. I would expect that "block until ready" is the way you want to roll (so this is a "sensible default" if you wish), but again, it's up to the client.

yacovm
2016-12-23 17:15
yeah this is exactly what I'm asking about- "it's up to the client"- the client is implemented in the peer, do you or anyone around here know what is the behavior? has it been discussed?

kostas
2016-12-23 17:16
Explicitly no, not that I'm aware of. Implicitly, it's the sensible default I referred to above.

umasuthan
2016-12-23 17:20
@kostas, do we have the side-chain support or preview in 0.6? I guess not. Correct me if I am wrong

kostas
2016-12-23 17:20
Correct.

scottz
2016-12-23 17:23
@umasuthan the security certificates obtained by the users from the COP (Certificate Authority) will allow users to read or write only what they are supposed to do. Each transaction request will be accompanied by the users certs, and it will be enforced by the peers, which are configured with the endorsement policies and validation policies that dictate who can see and change what things on that channel. Think of it as the peers fulfiling a query request during endorsement or validation steps, but only if the peer knows the user is allowed to see it.

umasuthan
2016-12-23 17:27
@scottz, that is at the transaction level and not at atomic data element level, right? Also, do we support the endorsement policy confifurations in V0.6? Currently we are evaluating v0.6 through a PoC for one of the clients. Apologies for asking too many questions

scottz
2016-12-23 17:51
yes; I believe the certs will be useful at transaction level, which is what you asked about different parties (users) in an insurance claim. no; the policies are new in v1.0. You might want to go read this white paper, if you have not already done so: http://www.the-blockchain.com/docs/Hyperledger%20Whitepaper.pdf and maybe https://hyperledger-fabric.readthedocs.io/en/latest/biz/usecases/

umasuthan
2016-12-23 18:12
Thank you so much @scottz for the pointers.

umasuthan
2016-12-23 18:13
Thanks @kostas for the clarifications.

wangjie
2016-12-26 07:18
has joined #fabric-consensus-dev

umasuthan
2016-12-27 03:51
A few questions on Consensus. 1. Is there a way to demonstrate a non-concensus scenario in 0.6 or 1.0 of Hyperledger? 2. Given that Hyperledger supports permissioned chain and all transactions are executed on all peers, under what circumstances, a non-consensus will result? One possible reason could be issues with syncing when a new peer comes up.

kostas
2016-12-27 05:13
Assuming that by non-consensus you refer to a scenario where the network fails to reach consensus, just deploy non-deterministic chaincode to a 0.6 network, e.g. a chaincode that stores a randomly-generated value. (call `rand`, then persist the state)

umasuthan
2016-12-27 05:20
Yes. I can understand that, but in case of a deterministic well written chaincode (typically deployed for asset management solutions), what could be the scenarios for network failing to reach consensus?

kostas
2016-12-27 15:48
In 0.6, one scenario would include a network partition that prevents you from having a quorum. For example, in a network of 10 validators, a network partition that takes 4 validators out makes it impossible for the network to move on.

umasuthan
2016-12-27 16:08
ok. Thanks so much for the clarifications @kostas

yacovm
2016-12-28 10:45
anybody home?

garisingh
2016-12-28 10:57
depends who you are actually looking for :wink:

yacovm
2016-12-28 10:58
I just saw that the sbft tests take 90 seconds. It seems like a lot, and I think it could be good if they could run in parallel

yacovm
2016-12-28 11:24
I managed to take them down to 35 seconds by playing with port numbers and adding t.Parallel: https://gerrit.hyperledger.org/r/#/c/3553/

makimaki18
2016-12-28 14:20
has joined #fabric-consensus-dev

kostas
2016-12-28 15:42
This is a welcome change, thx.

hgabor
2017-01-02 14:57
guys, I am planning to start sbft refactoring this week

hgabor
2017-01-02 14:57
e.g. moving from custom structures to the common ones

garisingh
2017-01-02 16:25
good luck @hgabor! :wink:

grapebaba
2017-01-03 02:39
Guys ,where can I find the document about bootstrap, configuration, reconfiguration

grapebaba
2017-01-03 02:40
BTW, i am curious if fabric orderer support runtime configuration like etcd?

muralisr
2017-01-03 02:43
@grapebaba https://jira.hyperledger.org/browse/FAB-359 is a good start and, going from there, searching “bootstrap” in JIRA should get you all the work that’s going or or planned

grapebaba
2017-01-03 03:03
@muralisr: thanks your quick response

muralisr
2017-01-03 03:16
sure thing

wangjie
2017-01-03 03:30
hello everyone , what is the meaning of the epoch ? It is the block or the time? and how to control replay attacks?

jyellick
2017-01-03 14:19
@wangjie The epoch is a function of the current block height. The exact details of this are still pending, but loosely, the submitter will set the epoch in the header, and, as the block height advances, the transaction will eventually age out and become invalid. This allows for the set of transactions which need to be tracked to prevent replay to be smaller.

tuand
2017-01-03 15:01
scrum ...

2017-01-03 15:02
@tuand has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/ulc6ezlyzraddepjq4qbgcwqnie.

wangjie
2017-01-04 02:34
@jyellickThank You !

tzipih
2017-01-04 04:23
has joined #fabric-consensus-dev

tom.appleyard
2017-01-04 14:06
@jyellick (or anyone else who knows) I have a couple more questions if you wouldn't mind answering them: >The endorsers execute the chaincode to produce a read/write set. What do the read/write sets produced by the endorsers consist of? Are they simply “I read in these variables and then wrote out these ones as specified by the chaincode, the specific values I read/wrote are x, y, z” If there are multiple client submitters on a network run by the same organisation, would they tell each other which chaincode has been deployed and share endorsement policies of said chaincode? If so, is this achieved through hyperledger itself or by some auxiliary service that shares the information? >ordered batches/blocks are retrieved by calling `Deliver'” (with regards to the ordering service API) Does this mean that the committers are polling the ordering service? >The ordering network may be offered as a service and not involve any of the entities transacting on the chain, or it may be run by one or more of the transacting entities. Does this mean that the ordering service is run as an auxiliary service and is not part of hyperledger? What is CFT/BFT? Given you said there is one ordering service for all chains, I take it the transactions for each chain are ordered separately? >eventually the committer gets a batch (which is really just a block which potentially contains some invalid transactions Is the ordering service a single node or is it many nodes? If there are many how do they co-ordinate themselves? How many transactions are in each batch? Can batches change size? When batches of ordered transactions are sent off what is in a 'batch' exactly – is it just a list of transactions? Which organisation on the network runs the ordering network? >invalid transactions What does 'invalid' mean in this context?

tuand
2017-01-04 14:10
CFT/BFT crash/byzantine fault tolerance

tuand
2017-01-04 14:11
the ordering service can be one or more nodes depending on the underlying ordering protocol e.g. solo versus Kafka/sBFT

tuand
2017-01-04 14:13
how the nodes co-ordinate also depends on the protocol , e.g. a kafka cluster using zookeeper or sBFT using the protocol described in the Castro&Liskov paper


tuand
2017-01-04 14:17
the chaincode deployment and endorsement policies can probably be answered better by the guys over in #fabric-peer-dev

tuand
2017-01-04 14:17
Deliver() is basically listening to a gPRC stream

tuand
2017-01-04 14:22
the ordering service runs as a separate process from the hyperledger fabric peers. Fabric does require an ordering service that supports the Broadcast()/Deliver API but otherwise you can plugin your own ordering protocol

jyellick
2017-01-04 15:10
@tom.appleyard It looks like @tuand already gave some answers, but here is some additional (partially redundant) info > What do the read/write sets produced by the endorsers consist of? Are they simply “I read in these variables and then wrote out these ones as specified by the chaincode, the specific values I read/wrote are x, y, z” Close. It's "I read these variables at these _versions_, and I wrote these variables at these versions, and this is what I wrote" > If there are multiple client submitters on a network run by the same organisation, would they tell each other which chaincode has been deployed and share endorsement policies of said chaincode? If so, is this achieved through hyperledger itself or by some auxiliary service that shares the information? In general, when someone is informed that a chaincode exists and can be invoked, they should also be informed where it resides, and what endorsements are required. This is not an in band hyperledger fabric procedure. > Does this mean that the committers are polling the ordering service? No, `Deliver` is a blocking gRPC call, which receives a stream of batches as they are created. > Does this mean that the ordering service is run as an auxiliary service and is not part of hyperledger? The ordering service is required, and multiple implementations are offered through the hyperledger fabric project, (solo [for testing], Kafka, and SBFT [a pbft based protocol]). But, any consensus implementation which appropriately implements the `Broadcast`/`Deliver` methods could be used. Most developers are more interested in the chaincode and application however, so it also provides a nice plug point where an external entity can host the ordering so that the user can focus on the pieces they care about. > Given you said there is one ordering service for all chains, I take it the transactions for each chain are ordered separately? Correctly, order is guaranteed only within a chain. > Is the ordering service a single node or is it many nodes? > If there are many how do they co-ordinate themselves? This is all variable based on the ordering implementation and backing consensus algorithm. For Solo, the answer is "One node, so no consensus algorithm needed". For Kafka: "Arbitrarily many nodes and the ZK/Kafka protocol", For SBFT: "3f+1 nodes and the sbft variant of the pbft protocol". > How many transactions are in each batch? > Can batches change size? > When batches of ordered transactions are sent off what is in a 'batch' exactly – is it just a list of transactions? The batch size is configurable, and may be reconfigured by the chain admins. A batch is a block structure, so they form a hash chain. The batch may contain 'invalid transactions' in so far as there may be MVCC conflicts etc. This is why we differentiate between the words batch and block though in implementation, they use the same backing data structure. > Which organisation on the network runs the ordering network? Depends entirely on network configuration, and is more of a social/business question. For some consensus implementations, multiple stake holders may participate, for others a single entity will be in control. > What does 'invalid' mean in this context? Transactions which are not appropriately signed by an authorized user will never make it into a batch, this is the 'validation' which is done at the orderer. However, if two transactions are submitted changing the same key at the same version for instance, one of these transactions will fail, but they would both make it into the batch. The one which fails we call 'invalid'. A transaction might also not be appropriately endorsed, which could cause it to be 'invalid'. The signing guarantee from the ordering network however assures that no one simply submits garbage and that there is non-repudiation on the submission

yangtao76
2017-01-04 15:55
has joined #fabric-consensus-dev

tom.appleyard
2017-01-04 16:58
@tuand @jyellick Thanks as alway guys! Couple of follow ups: >Close. It's "I read these variables at these _versions_, and I wrote these variables at these versions, and this is what I wrote" What happens if an endorser node gets out of step with the others and doesn't have high enough version numbers to match what is in a read/write set? >The ordering service is required, and multiple implementations are offered through the hyperledger fabric project But just to check it's not hyperledger code that is running per se, it's simply that we use other projects to do the work for us (once of course we've configured them to use these `Broadcast`/`Deliver` endpoints)? >A transaction might also not be appropriately endorsed, which could cause it to be 'invalid'. Surely if a transaction hasn't been appropriately endorsed it wouldn't be submitted to the ordering network? >they would both make it into the batch Would this batch be rejected on the grounds of an invalid transaction before it's sent out?

jyellick
2017-01-04 17:05
@tom.appleyard > What happens if an endorser node gets out of step with the others and doesn't have high enough version numbers to match what is in a read/write set? The client will see that the endorsement isn't for the same action, and ask a different endorser, or wait some time and ask again. @muralisr might have more details > But just to check it's not hyperledger code that is running per se, it's simply that we use other projects to do the work for us (once of course we've configured them to use these ⁠⁠⁠⁠Broadcast⁠⁠⁠⁠/⁠⁠⁠⁠Deliver⁠⁠⁠⁠ endpoints)? Yes and no. Solo is entirely hyperledger code, as is sbft. Kafka is some hyperledger code to act as a shim to Kafka, but does rely on the Kafka/Zookeeper code for the backing consensus implementation. So, depends on how you deploy your network as to whether the orderer is 'some hyperledger code' or 'all hyperledger code' (and there is nothing that prevents someone from implementing these methods on their own and having 'no hyperledger code'). > Surely if a transaction hasn't been appropriately endorsed it wouldn't be submitted to the ordering network? For a properly written non-malicious client, yes. There's no real incentive to submit improperly endorsed transactions though. > Would this batch be rejected on the grounds of an invalid transaction before it's sent out? No, the ordering service only really does the check of "was the submitter authorized to submit a transaction", it does not actually process any of the MVCC+postimage data, it does not know how, nor does it know about endorsement policies etc. This is all handled after the batch has been delivered to the peer.

tom.appleyard
2017-01-04 17:16
cool, thanks @jyellick :slightly_smiling_face: one last question: >postimage What's a postimage - is this a snapshot of what the worldstate/variables look like after the transaction executes?

jyellick
2017-01-04 18:06
> What's a postimage - is this a snapshot of what the worldstate/variables look like after the transaction executes? Right, whereas MVCC is concerned with the version of the keys, the postimage is the new value of (the written) keys.

jyellick
2017-01-04 18:14
@tom.appleyard ^

eagel
2017-01-05 08:15
has joined #fabric-consensus-dev

jojocheung
2017-01-05 08:45
has joined #fabric-consensus-dev

joshhus
2017-01-05 14:30
Hello, is one of these design docs current for v1.0 consensus support? -- https://wiki.hyperledger.org/community/fabric-design-docs -- Or what should I use as source for drafting external v1.0 consensus doc. Thanks!

tuand
2017-01-05 14:33
@joshhus , start with the architecture, flows and multi channel docs on the wiki page, also readme in dir hyperledger/fabric/orderer

joshhus
2017-01-05 14:39
Okay thanks @tuand. I assume they describe the diffs/how to choose between Kafka and SBFT, for example. If only one consensus protocol per network is supported. (Or could I have one/separate SOLO channel/chain, e.g.) What happens if a change in consensus protocol for an existing network is desired, etc. Questions like those. Inquiries are also coming in as to when each protocol will be ready/available.

tuand
2017-01-05 14:48
not at that level of detail yet josh ... but start from those docs and we can hash out what's needed here

tuand
2017-01-05 15:00
scrum ...


newdev2524
2017-01-06 02:28
has joined #fabric-consensus-dev

newdev2524
2017-01-06 02:35
Hi, I'm using PBFT on v0.6. Is there any guidelines or practices to set VIEWCHANGEPRERIOD and K value? If we have less transactions , how about setting it as 1 for both?

jyellick
2017-01-06 02:58
@newdev2524 For low transaction networks this may be acceptable. Every K blocks, an additional message is exchanged called a checkpoint. Every VIEWCHANGEPERIOD checkpoints, the PBFT leader changes.

newdev2524
2017-01-06 03:28
@jyellick Thank you very much for your answer. BTW, does this mechanism of PBFT carry over to v1.0?

warong
2017-01-06 03:32
has joined #fabric-consensus-dev

jyellick
2017-01-06 03:45
@newdev2524 One of the consensus options in v1.0 is SBFT, a variant of PBFT offered in v0.6

newdev2524
2017-01-06 07:43
@jyellick Thanks : )

ecblseg
2017-01-06 19:39
has joined #fabric-consensus-dev

tuand
2017-01-09 15:00
scrum ...

2017-01-09 15:01
@tuand has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/opsfqne475hzjl2cahb6wpxmree.

scottz
2017-01-09 16:40
@tuand @jyellick When a user client broadcasts a transaction to the orderer, the orderer service is supposed to deliver it in a block to the peers listening on that channel. The design docs do not talk about any broadcast responses to the user. Please clarify: Does the user always get some sort of immediate feedback in a response message - either Success or an appropriate error code (for example If the transaction badly formatted or not signed)? Or is it sent back only when an error is found?

jyellick
2017-01-09 16:42
Given a stream of `cb.Envelope`s passed into `Broadcast`, each envelope will get a corresponding `cb.Status_SUCCESS` reply, until there is a failure, which will receive a non `cb.Status_SUCCESS` reply, and the server will terminate the stream.

jyellick
2017-01-09 16:42
Any in flight non-acknowledged messages will be discarded.

jyellick
2017-01-09 16:43
Only messages which are replied to with a `cb.Status_SUCCESS` are guaranteed to be 'in consensus'

jyellick
2017-01-09 16:43
@scottz ^

scottz
2017-01-09 17:03
@jyellick To clarify that point: if user sends two msgs quickly, where the first one leads to an error response such as 403 or 404, then what is the liklihood that the 2nd would be discarded? I.e. what does "in-flight" mean in this context? Or, maybe the more practical question would be: couldn't I expect the https to handle resending it for me? If so, then in practice I could essentially ignore this and interpret your answer to mean that "yes the sender of every broadcast message should normally get a response"

jyellick
2017-01-09 17:03
@scottz If the user sends two messages quickly, and the first one errors, then the second will, with 100% certainty be discarded.

jyellick
2017-01-09 17:04
If you wish the second message to enter the system, you need to re-establish the `Broadcast` link and submit it again

scottz
2017-01-09 17:04
oh, so a new stream is started by the sender, after receipt of the error-coded response?

jyellick
2017-01-09 17:04
Correct, because the server will have terminated the `Broadcast` stream on error

scottz
2017-01-09 17:11
ok. that answers my original question as well as provides expected behavior for more detailed use-cases. The user would be expected to resend any transactions that are not confirmed, But do you know - would that be a resposibility/function of http, or of the client (behave test client, or the SDK/applic code)?

jyellick
2017-01-09 17:17
It would be on the client

jyellick
2017-01-09 17:17
The client should basically keep a buffer of unacknowledged requests, and, if the stream terminates, it should reconnect and resend what's in the buffer

scottz
2017-01-09 17:28
ok. thanks. that is what I was thinking. I remember reading in common.pb.go that we use error codes similar to those used by http; so I need to keep straight what gets handled in which layer.

yacovm
2017-01-09 22:45
Question- does the OS check for anything besides signatures of the client on the transaction broadcasted? Meaning- is it possible to broadcast() lots of fake/un-endorsed transactions to the OS to make it cut blocks with mostly invalid transactions that would be marked as invalid in the peers?

jyellick
2017-01-09 23:16
@yacovm The only check is on the outer signature. There is no check on the endorsements, so yes, definitely a client could submit a bunch of bad trans to force frequent block cutting.

jyellick
2017-01-09 23:17
But, since they are validly signed, it should be trivial to identify the misbehaving party, revoke their authorization on the system, and proceed normally.

jyellick
2017-01-09 23:18
The ordering service does not know what an endorsement is, or how to decode the `Data` or an `Envelope` of type `ENDORSER_TRANSACTION`

yacovm
2017-01-09 23:18
only in retrospect though

yacovm
2017-01-09 23:18
it'll be really hard

yacovm
2017-01-09 23:19
since the ordering service doesn't know something wrong is happening, and the peers I assume, only give indication in the logs

yacovm
2017-01-09 23:20
so it'll be like: peers log lots of errors/warnings --> hopefully there is some log monitoring agnet *and* a monitoring/operations team seeing alerts ---> they contact the cop(?) and it publishes a CRL?

jyellick
2017-01-09 23:20
Someone could also submit many transactions with knowingly bad MVCC sets. It would be a slightly trickier thing, but unless the orderer is maintaining all the state and essentially becoming a peer committer, I think this is very hard to dodge.

jyellick
2017-01-09 23:21
I think in general, the network admins will want to monitor for 'invalid transactions', and if there's a high volume of any sort, investigate why

yacovm
2017-01-09 23:21
wait, peer committer? I thought the OS is *orderer-ledger*

yacovm
2017-01-09 23:22
I didn't know the OS will do block validation

jyellick
2017-01-09 23:22
It does not

yacovm
2017-01-09 23:22
I also remember you said it currently has a "toy ledger"

jyellick
2017-01-09 23:22
I was simply saying, unless you want the ordering service to do exactly all of the validation that the peer does (which we do not want to do), then a clever client will be able to submit junk into the system

yacovm
2017-01-09 23:23
yeah

jyellick
2017-01-09 23:23
However, this clever client will have to submit the junk under his own ID

jyellick
2017-01-09 23:23
Which makes the attack significantly less attractive.

yacovm
2017-01-09 23:23
what about replay attacks?

jyellick
2017-01-09 23:23
The envelope header is designed for replay attack detection

jyellick
2017-01-09 23:24
It's unimplemented, but, all the pieces are there

yacovm
2017-01-09 23:24
how? timestamp?

jyellick
2017-01-09 23:24
Timestamp + epoch + nonce

yacovm
2017-01-09 23:24
I see

yacovm
2017-01-09 23:25
ok thanks, was just curious

jyellick
2017-01-09 23:25
Sure thing, let me know if I can answer any other questions

yacovm
2017-01-09 23:43
@jyellick , I somehow mustered the cognitive strength to answer the JIRA issue you commented today (MSP replication). But I have a follow up question I prefer asking here and I hope isn't stupid or already discussed, and if it is- it's probably because of the hour here :wink:

yacovm
2017-01-09 23:44
You know the story with the anchor peers right?

yacovm
2017-01-09 23:56
I know you guys were against saving the peer membership in the ledger, I guess rightly-so because that would impact throughput if the churn rate is high, but- is it possible to update the anchor peer list, within a channel (if needed - meaning- if the anchor peers die) once in a long period of time? (I was thinking perhaps, once per a few minutes, if needed of course). The only "problem" with the anchor peer (at least as I see it) is that it needs to be alive all the time. If somehow the gossip layer within an org could update its anchor peer in all channels of the org, that problem would go away.

jyellick
2017-01-10 00:08
@yacovm So absolutely, it _can_ be updated, but updating once every few minutes seems like far too frequent to me. I thought the proposed solution here had been to use a DNS address which could rotate through IPs round robin or have updates pushed to it trivially

jyellick
2017-01-10 00:08
(Especially as you stay protected from bad connections via TLS)

yacovm
2017-01-10 00:08
a few min is the lower bound

yacovm
2017-01-10 00:08
the upper bound is, never

yacovm
2017-01-10 00:09
the DNS solution has major drawbacks IMO

yacovm
2017-01-10 00:12
0) This is the major one: you need org A to be able to update the information in org B about the anchor peer(s) of org A. You assume org B would give org A access to update its DNS records? very unlikely... 1) You can't program this into fabric because you would have to integrate with many types of DNS providers \ server types, it seems brittle 2) I was thinking that a customer could specify in the core.yaml whether this peer is a candidate to be an anchor peer or not. and that's all the customer would need to configure, the rest would be done by the fabric, magically.

jyellick
2017-01-10 00:16
0) I'm confused by this one. Why does org A need to be able to update the info for org B? Why would org B allow this? 1) Agreed 2) I agree this is a nice idea, and I'd say this is something that could be implemented as a normal endorser transaction (and not a configuration transaction) but then you have the problem of bootstrapping. The nice thing about the config transaction is that everything you need to bootstrap at any point is in there. But the more frequently it changes, the worse properties it has.

yacovm
2017-01-10 00:16
where is the bootstrapping problem?

yacovm
2017-01-10 00:17
no need the bootstrap to have them (the anchor peers)

jyellick
2017-01-10 00:18
If you don't need anchor peers at bootstrap, I'd say kick them out of the configuration block, and make it a normal chaincode.

jyellick
2017-01-10 00:18
Then you can change things as often as you'd like

jyellick
2017-01-10 00:18
I thought it was needed for state transfer to function reasonably for a new organization

yacovm
2017-01-10 00:19
oh no, it is needed for cross-organization + it's now an implementation detail that I use in the code to enumerate the organizations of a channel (by the anchor peers)

yacovm
2017-01-10 00:20
I know, perhaps I should have made it something like:

yacovm
2017-01-10 00:20
channel organizations := thisPeerOrg \cup {organizations of anchor peer list}

yacovm
2017-01-10 00:21
Regarding 0- sorry, I'm not thinking clearly at 2:20 AM. Obviously if org B can query org A's DNS server it's enough and org A can simply update *its own DNS records*

yacovm
2017-01-10 00:22
it is also needed to establish view (membership) of all peers of the channel

yacovm
2017-01-10 00:22
some... clients, require this

yacovm
2017-01-10 00:24
> The nice thing about the config transaction is that everything you need to bootstrap at any point is in there. But the more frequently it changes, the worse properties it has. So, what I'm saying is- if an anchor peer is selected wisely within an organization, and it doesn't die every few minutes - rather it stays stable- perhaps this is the right path to take?

jyellick
2017-01-10 00:25
Re 0: Right, maybe this is unreasonable to think org A will publish its DNS names in a publically accessible way, this just seemed like a standard piece that most orgs already had. Maybe feedback from someone like @garisingh on how real world deployments are likely to be done would help

yacovm
2017-01-10 00:26
and advantage 3) not every hyperledger client wants to install a DNS server

yacovm
2017-01-10 00:26
and fabric doesn't come with a DNS server

jyellick
2017-01-10 00:26
Well, you don't _have_ to install a DNS server, for a small deployment, you can reference things by IP, but DNS would give you additional flexibility.

yacovm
2017-01-10 00:26
what? IP is even worse

yacovm
2017-01-10 00:27
this requires you to do L-3 load balancing/clustering

jyellick
2017-01-10 00:27
I mean you can roll it out without DNS, for a 'small deployment where deploying DNS is onerous'

jyellick
2017-01-10 00:27
But publishing a few DNS records seems like it should be trivial for most people....

yacovm
2017-01-10 00:27
yeah but in small deployments you mean the anchor peer is well, an anchor?

jyellick
2017-01-10 00:28
Right, for something like a POC test net where you want things entirely hyperledger contained

jyellick
2017-01-10 00:30
With TLS, publishing DNS records even through a third party seems relatively safe

yacovm
2017-01-10 00:31
but the bottom line is, are you against my idea given the promise that the update rate per channel will have a lower bound of once per X minutes? I think that actually, if an anchor peer dies it's OK as long as a new channel isn't created

jyellick
2017-01-10 00:31
Right, bottom line, I think that's too frequent.

yacovm
2017-01-10 00:31
what time-span isn't too frequent in your opinion?

jyellick
2017-01-10 00:32
I'd say scheduled changes maybe quarterly? With obvious exceptions for something like adding a new org member.

yacovm
2017-01-10 00:33
quarterly is 15 min?

jyellick
2017-01-10 00:33
3 months

yacovm
2017-01-10 00:34
I'm talking about lower bound. if the anchor peer dies, how long to wait until a new one is elected in the org, and published on all channels

jyellick
2017-01-10 00:36
I always go back to something that Simon said a while back. If someone actually wants to implement this system securely, they're going to put their admin keys on a USB key, and only access them on an air gapped machine inside a vault. The notion that reconfiguration is automated is a bit antithetical to this idea.

yacovm
2017-01-10 00:37
btw- "With TLS, publishing DNS records even through a third party seems relatively safe" I assume you mean that org A and org B have DNS replication among them?

yacovm
2017-01-10 00:37
because, DNS is plaintext UDP from what I know. so you assume that each org queries its own DNS for the records of the other org's DNS


jyellick
2017-01-10 00:38
But more that, if someone hijacks your records, they cannot impersonate you, because they do not have a correct certificate chain to make the TLS connection,.

yacovm
2017-01-10 00:39
I don't think customers would like this, but maybe I'm wrong.

jyellick
2017-01-10 00:40
I'm certainly open to other ideas, and would really like to hear what a real deployer is likely to like or not like

yacovm
2017-01-10 00:40
Anyway, I need to get up tomorrow (today) morning too. if anyone is reading this you should start from: https://hyperledgerproject.slack.com/archives/fabric-consensus-dev/p1484006164001258

jyellick
2017-01-10 00:41
The configuration block is intended to be 'relatively static configuration for the chain'. And so overloading it with information which is anticipated to change (beyond long term administrative tasks like key rotation) feels wrong to me.

jyellick
2017-01-10 00:42
For instance, if an entity can make configuration changes, they can do a fairly effective denial of service to prevent others from reconfiguring the chain.

yacovm
2017-01-10 00:42
you can just send transactions

yacovm
2017-01-10 00:42
what's the difference?

yacovm
2017-01-10 00:42
that the conf. block has only 1 transaction?

jyellick
2017-01-10 00:42
To construct a configuration transaction you must know the sequence number and contents of the previous configuration block.

jyellick
2017-01-10 00:43
So, if someone is rapidly sending reconfiguration transactions, incrementing the sequence number and tweaking the contents, then the other parties cannot guess the next seqno and contents

jyellick
2017-01-10 00:43
It is essentially an exclusive lock.

jyellick
2017-01-10 00:44
In low trust networks, I would expect for no configuration modification to be allowed without at least two parties participating, more likely, the byzantine threshold

jyellick
2017-01-10 00:44
So, for me to modify my MSP, I should sign, and get f others to sign as well.

yacovm
2017-01-10 00:44
but the only ones that can do that are the peers of the channel!

jyellick
2017-01-10 00:44
'do that'? What is 'that'?

yacovm
2017-01-10 00:44
send a conf. block

jyellick
2017-01-10 00:44
Peers generally don't send configuration transactions

jyellick
2017-01-10 00:45
The application can send one to construct a channel (which may have lower privilege requirements to allow automated creation)

yacovm
2017-01-10 00:45
yeah, but I'm saying they all know the seq numbers.

jyellick
2017-01-10 00:45
They do

jyellick
2017-01-10 00:45
But, if someone were rapidly hammering on the service, incrementing the sequence number, the peer sequence numbers would lag

yacovm
2017-01-10 00:46
I understand. this is a viable attack.

yacovm
2017-01-10 00:46
just like... you know what

yacovm
2017-01-10 00:46
sending fake transactions that would make frequent block cutting

yacovm
2017-01-10 00:46
anyway, I'm off. ttyl

jyellick
2017-01-10 00:47
Alright, we can discuss more tomorrow

scottz
2017-01-10 03:04
I thought the anchor peers was a static list, that an admin can hand out to new member organizations that would like to join the network. If that list is changing regularly, that becomes an Administrator's pain-in-the-arse. Yes let's think carefully about the impact...

scottz
2017-01-10 03:16
I have a different question, this time about what user-registration really means in v1.0. @jyellick you said earlier that "transactions which are not appropriately signed by an authorized user will never be put into a batch". Now, when a user registers with a peer, they are allowed to submit transaction proposals, right? Does that mean (as in v0.5) they can submit to just that one peer? Does it include ALL channels that the peer knows about (now, or in the past, or future)? OK, what I really want to know is: Do the orderers ensure that a broadcast transaction is signed by an authenticated-user on a PER-CHANNEL basis? From the orderer's perspective, what exactly is an authenticated user?

scottz
2017-01-10 03:24
Other thoughts: We must register a user with a peer component, but it does not make sense to register a user with a specific orderer component - because a peer could submit transactions to any orderer, right? Yet, we say the orderers will accept broadcasts only from authenticated users. Does the user submit broadcasts using a peer's cert (from one of the peers that endorsed its proposal, which is a member of the channel for which the transaction applies)?

jyellick
2017-01-10 03:50
@scottz The orderer cares only whether the signature on the transaction (`Envelope`) satisfies the orderer ingress policy. Most likely, this policy is that the signature is an authorized user of any of the chain MSPs. This policy is specified per chain. There is no notion of 'user-registration' at the orderer, only identity and signature. Also keep in mind that the peers no longer submit transactions in the new architecture, so there's no notion of a "peer's cert" submitting a transaction, this will be a user/application cert.

grapebaba
2017-01-10 04:37
guys, I remember some of you sent the sbft paper link yet, could anyone send me again? more appreciate

scottz
2017-01-10 04:39
@jyellick "the signature is an authorized user of any of the chain MSPs". I intrpret that as any user that registered with any peer in any member-organization that is participating in the transaction's associated channel. But your statement seems contradictory: if there is no notion of peer cert, then how does the orderer determine from the user's signature if it is part of a member org which is participating on that channel?

scottz
2017-01-10 04:40
maybe it might help if I could find the definition of what exactly is in the signature?

jyellick
2017-01-10 05:22
@scottz There is the notion of Policy. In the case of `Broadcast`, this indicates what 'identities' (where identity may be a specific certificate, or certificate attribute, or any other principal supported by the MSP) are authorized to invoke `Broadcast` on a particular chain. The orderer has no notion of what a peer is, only what the ingress policy for a specified chain is. Typically, you can expect that the orderer ingress policy allows all MSP user certs for a chain to be allowed to invoke `Broadcast` for that chain.

nhrishi
2017-01-10 07:20
has joined #fabric-consensus-dev

xixuejia
2017-01-10 09:20
@jyellick Hi Jason, I'm curious about the replay attack prevention. Why not just using txId? Since txId is unique in the ledger, replayed txID will not be accepted. Forgive me if I'm asking a silly question :smile:

subax
2017-01-10 11:14
has joined #fabric-consensus-dev

hgabor
2017-01-10 11:35
@xixuejia In case of chaincodes, one can run the same chaincode with the same arguments. As far as I know, this would result the same transaction so the txID would be the same. (I hope this has any relationship with your question)

xixuejia
2017-01-10 11:58
@hgabor Thank you for your response. To be more clear, client sends a tx(txId: 123) to OS and committed as a valid tx in ledger. If a malicious node replays this tx(txId: 123), the committer node won't accept it because there's already a tx w/ same txId in the ledger. My question was whether the replay attack could be prevented by verifying the unique txId?

xixuejia
2017-01-10 11:58
or even depending on the MVCC check? because replayed tx should not pass MVCC validation

hgabor
2017-01-10 12:04
> My question was whether the replay attack could be prevented by verifying the unique txId? @xixuejia I guess it could be (depending on the chaincode / actual application) > or even depending on the MVCC check? because replayed tx should not pass MVCC validation MVCC is the validation chaincode right? and one can write its own validation chaincode. so yes, one can write one that does the appropriate check

xixuejia
2017-01-10 12:48
thanks Gabor

gengjh
2017-01-10 13:46
@grapebaba are you looking for this? http://sammantics.com/blog/2016/7/27/chain-1

jyellick
2017-01-10 13:50
@xixuejia The problem of replay protection is not so much in identifying replayed transactions and more of identifying them efficiently. The `epoch` field in particular is used to scope the set of transactions which must be checked against the new transaction. Doing a full select across all transactions ever is not a scheme that would scale over time.

xixuejia
2017-01-10 13:58
@jyellick I see. Thanks for the explanation

kostas
2017-01-10 15:05
@grapebaba @gengjh The paper behind SBFT is really the classic PBFT paper by Liskov. There are differences (the main one being that you process a single batch of requests at a time), but to my knowledge, the closest reference material remains the PBFT paper.

kostas
2017-01-10 15:06
I would suggest we rename this work to something else than SBFT eventually, as people always assume that we refer to the Chain implementation, but that's a low-priority item.

jyellick
2017-01-10 15:06
The other key difference I would point out is that the PBFT paper assumes unordered (UDP type) links, while SBFT assumes FIFO (TCP type).

grapebaba
2017-01-10 15:10
Thanks guys, I found that ticket in Jira

grapebaba
2017-01-10 15:11
well described

adc
2017-01-10 15:27
@binhn @vukolic I see that pbfts uses util.ComputeCryptoHash to computes hashes. I was wonder if those hashes should be computed by the BCCSP. Also, I was thinking that the hash function should be configurable in an independent way, I mean by having a specific property under pbft

vukolic
2017-01-10 15:29
@adc lets sync on these. No integration with common libs has been done in sbft so far

vukolic
2017-01-10 15:29
I expect this not to be rocket science

adc
2017-01-10 15:30
me too, but we have to decide. Crypto calls need to be reordered

jyellick
2017-01-10 15:36
@adc I would like to specify the hashing parameters in the genesis block

adc
2017-01-10 15:36
+1

adc
2017-01-10 15:36
I think the gensis block is a perfect place for this kind of configurations

jyellick
2017-01-10 15:37
@vukolic I'm guessing you are aware, but @hgabor has started some work on moving to the common libs

vukolic
2017-01-10 15:37
I am

adc
2017-01-10 15:38
I see another issue in sbft. When a signature is generated sha256 is always used to compute the digest

adc
2017-01-10 15:38
but if the underlying curve used for ECDSA is P384, than sha384 needs to be used

adc
2017-01-10 15:39
we need to be more uniform there

vukolic
2017-01-10 15:39
It is not entirely cleat to me that this is the case

adc
2017-01-10 15:39
look at the code, it tells the truth

adc
2017-01-10 15:39
:slightly_smiling_face:

vukolic
2017-01-10 15:40
We cannot insist that every consensus protocol supports all the crypto

adc
2017-01-10 15:40
fair enough

adc
2017-01-10 15:40
then if some tries to sign with P384 a digest shorter than 384 bits, then it rejects

vukolic
2017-01-10 15:40
But lets see if such support those not bloat the code than i may be in favor

adc
2017-01-10 15:41
at the least, I would like to have consistency in using the algorithms

adc
2017-01-10 15:41
that you can also support only P256 and SHA2-256

adc
2017-01-10 15:41
I'm perfectly fine

vukolic
2017-01-10 15:41
If you can open jira to track this that be great

jyellick
2017-01-10 15:45
@adc I think you would be a better person to define the hashing parameters for the genesis block. To my mind, I thought we would need to specify hashing algorithm for the block header, whether to compute the data hash via Merkle (with specified width) or flat, and possibly the hashing algorithm to use within the Merkle tree. But, I'm certain there are other places hashes are used (like sbft/pbft). I'm not certain what a reasonable expectation would be. I had thought maybe one global 'hashing algorithm' which would be used anywhere it was not otherwise specified (like for MSP signature validation), but it sounds like maybe that's inadequate?

adc
2017-01-10 15:47
I' m actually a bit worried about this global settings that can work for everything. For sure, when digest has to be computed to be signed then it is a different story. The digest must be computed according to the signing algorithm requirements

vukolic
2017-01-10 15:48
Genesis block needs to have custom field that can be populated by consensus protocol

adc
2017-01-10 15:48
for the rest, it might be fine to have a global one at least at very beginning and with versioning in such a way we can always change and be retro compatible

vukolic
2017-01-10 15:48
If you already want to standardize the genesis block

vukolic
2017-01-10 15:48
A specific config should not be mandated

jyellick
2017-01-10 15:49
@vukolic Genesis block already has this function, you can see it being used by Kafka to populate the brokers for instance

vukolic
2017-01-10 15:50
Very good so no need to mandate things more uniform imo

jyellick
2017-01-10 15:51
Right, SBFT or whatever consensus algorithm can do whatever config it would like. But, it would make sense to have a 'hashing default', to essentially determine the behavior of hashing whenever the implementer does not want to have to pick

adc
2017-01-10 15:53
should then that component be in charge of defining what's default? Not sure.

adc
2017-01-10 15:54
let's take this example

adc
2017-01-10 15:55
the struct BlockHeader has an Hash method

adc
2017-01-10 15:55
it is implemented by using util.ComputeCryptoHash that uses sha256 (hard-coded)

adc
2017-01-10 15:55
where the configuration of this hash function should come from?

vukolic
2017-01-10 16:53
i do not object defining hash function for hashchaining the blocks

vukolic
2017-01-10 16:54
but internally i do not think that we can mandate much from a consensus implementation

vukolic
2017-01-10 16:54
if an implementation hardcodes sha256 internally it may do so - and people may decide not to use such an implementation if they do not like it

vukolic
2017-01-10 16:54
my point is - we want consensus modular so we need to give it some leeway

vukolic
2017-01-10 16:55
standardizing the block format so clients can consume it makes sense

vukolic
2017-01-10 16:55
but mandating a given signature function is imo too much

vukolic
2017-01-10 16:57
sbft is less of a problem but think of integration of bft smart or any other third party protocol

kostas
2017-01-10 16:57
I'm only casually glancing at this conversation and I don't think you guys disagree here. A sensible default, along with the option to easily change as the user sees fit is what pretty much everyone is arguing for, no?

vukolic
2017-01-10 16:58
you want to give such an implementation room in the genesis block to store its custom config and that's about it

vukolic
2017-01-10 16:58
@kostas perhaps :slightly_smiling_face:

vukolic
2017-01-10 16:59
do we have a "standard " hash function to hashchain the blocks in the ledger?

vukolic
2017-01-10 17:00
or is this itself configurable

kostas
2017-01-10 17:03
Unless something was changed while I was away, we're invoking `ComputeCryptoHash` from the `core/util` package for this. And this is currently hardcoded to SHA-3. These are some of my thoughts on the matter: https://jira.hyperledger.org/browse/FAB-887?focusedCommentId=19743&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-19743

vukolic
2017-01-10 17:06
I am all in for a most efficient function which provides "sufficient" security - which would in this case second SHA256 to be default instead of SHA3

vukolic
2017-01-10 17:08
that said i do not object it being configurable

vukolic
2017-01-10 17:09
but this is the only place (talking about the function used to hashchain the ledger blocks) where I see the need for some "standardization" across consensus implementations

kostas
2017-01-11 18:40
For all sbft-related work, let's please create issues in JIRA, link from the changesets, and update the JIRA status accordingly.

kostas
2017-01-11 18:40

yacovm
2017-01-11 19:05
https://gerrit.hyperledger.org/r/#/c/3873/ @jyellick You're saying there was the same file (well, almost the same) the whole time at 2 different places in the file tree?

jyellick
2017-01-11 19:06
After you run `make protos` you'll notice an untracked `attributes.pb.go` in `core`

yacovm
2017-01-11 19:06
oh so it was never checked in

jyellick
2017-01-11 19:06
Right

yacovm
2017-01-11 19:07
but... the old one was in the wrong place right?

yacovm
2017-01-11 19:07
it was used the whole time

yacovm
2017-01-11 19:07
so why don't you delete it as part of the change set?

yacovm
2017-01-11 19:08
I assume someone was linking to the attributes file

jyellick
2017-01-11 19:09
So no. You can see: ``` commit 9ed9ce44b45d9c37d4fb1112061927cb5ccba5d7 Author: Angelo De Caro <adc@zurich.ibm.com> Date: Wed Dec 14 09:28:51 2016 +0100 core/crypto/primitives cleanup: second step This change-set continues the cleanup of the core/crypto/primitives package. Refactoring has been applied to move methods and files under the packages which need them. Change-Id: Icfe6adf938b9d96df9dfde3dfebf95f3004fcae7 Signed-off-by: Angelo De Caro <adc@zurich.ibm.com> ```

yacovm
2017-01-11 19:10
that's core/crypto/primitives though.

yacovm
2017-01-11 19:10
</nit-picking>

jyellick
2017-01-11 19:10
This changeset moved `attributes.proto` and `attributes.pb.go` from `fabric/core/crypto/attributes/proto` to `fabric/accesscontrol/attributes/proto`

jyellick
2017-01-11 19:11
So, the `attributes.pb.go` was still correct from a compilation perspective, no need to really regen it just because the package moved (since its base package name stayed the same)

jyellick
2017-01-11 19:12
But, when `make protos` is run, it's writing any updates to `fabric/core/crypto/attributes/proto/attributes.pb.go` which is not a tracked file anymore

jyellick
2017-01-11 19:12
When the intent clearly, is to have `make protos` write to `fabric/accesscontrol/attributes/proto/attributes.pb.go` (the tracked file)

yacovm
2017-01-11 19:13
"This changeset" --> adc's?

jyellick
2017-01-11 19:13
Yes

jyellick
2017-01-11 19:13
So, the new changeset simply fixes the `go_package` so that the updates get written to the tracked file in the new location, rather than the untracked in the old. Nothing to delete.

yacovm
2017-01-11 19:14
I see. so when people did `make protos` a new file was added and no one paid attention

jyellick
2017-01-11 19:14
Right, or, because `make protos` was broken, no one was running it, so they didn't get a new file

yacovm
2017-01-11 19:15
Got it, thanks!

jyellick
2017-01-11 19:15
Either way, just something small that slipped through the cracks, I noticed this new file in my untracked changes and wondered where it came from, so put that CR together

jyellick
2017-01-11 19:15
No problem

adc
2017-01-12 07:16
@jyellick Thanks a lot for this. I haven't noticed that

jyellick
2017-01-12 07:27
@adc You're quite welcome, you're not the first to miss this `go_package` directive. Hopefully we will get some CI to test for this sort of mistake in the future.

wanghaibo
2017-01-12 09:17
has joined #fabric-consensus-dev

tuand
2017-01-12 15:00
scrum ...

2017-01-12 15:01
@tuand has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/cne4mj5rvbgi7nqpgtjvvivrxme.

vukolic
2017-01-13 09:25
@kostas @jyellick can you pls remind me why one needs to use blockcutter in sbft

hgabor
2017-01-13 09:27
@vukolic means that I implemented the Chain interface and this `HandleChain` for SBFT. so SBFT uses the ledger passed to it inside that support structure, and supports multiple chains. do we need block cutter? or can we leave it out and rely on sbft's internal batching?

vukolic
2017-01-13 10:06
put differently: how dependent are consuming components on the current impl in /orderer/common/blockcutter

vukolic
2017-01-13 10:08
a side comment - "Ordered" is a bit obscure naming when the function should be used prior to ordering

vukolic
2017-01-13 10:11
especially since what the function is doing is Validation (applying filters), Appending to block and (possibly) Cutting

yacovm
2017-01-13 14:23
Is there an ordering service system channel?

jyellick
2017-01-13 15:09
@vukolic @hgabor > @kostas @jyellick can you pls remind me why one needs to use blockcutter in sbft It is two fold. One, block cutter applies second pass filters to messages to ensure that only properly signed and otherwise well formed messages make it into the block. Secondly, the filters produce Committer objects which are the driving force behind the executable transactions on the orderer. The configuration transactions and a special internal orderer transaction are the executable types. This is what drives chain creation, as well as updates to ACLs etc. > @vukolic means that I implemented the Chain interface and this HandleChain for SBFT. so SBFT uses the ledger passed to it inside that support structure, and supports multiple chains. do we need block cutter? or can we leave it out and rely on sbft's internal batching? You can certainly re-implment block cutter on your own, but I'm still looking for a reason why you might do this. What technical problems does block cutter cause? My suspicion is that it's much easier to fix blockcutter to also support your use cases than to re-impliment it > a side comment - "Ordered" is a bit obscure naming when the function should be used prior to ordering Agreed, this is a rather horrid name. In Solo/Kafka it's accurate, but I'd be happy to see it renamed. I'd also point out it should probably be called `batchcutter` or somethig similar. The `blockcutter` name came from some discussions that were ongoing about Kafka at the time. @yacovm > Is there an ordering service system channel? Yes, the ordering system chain is the first chain that the orderer starts with. In general, peer orgs will not have access to this channel.

yacovm
2017-01-13 22:17
Thanks @jyellick . I have another small question- when an application creates a channel, does it pass the channel's ID at creation? I guess it rolls a random ID and hopes it's not taken, right? Because- i don't see any other option (you can't call "Deliver" after the Broadcast because you don't know the channel ID)

jyellick
2017-01-13 22:18
@yacovm Yes, it should be some form of UUID, such as FQDN+timestamp or something

jyellick
2017-01-13 22:18
If it exists, it will get back a FORBIDDEN

yacovm
2017-01-13 22:18
FQDN? what? why?

jyellick
2017-01-13 22:18
It can be a true UUID if it likes

jyellick
2017-01-13 22:18
Whatever scheme will be globally unique

yacovm
2017-01-13 22:19
ok tnx

yacovm
2017-01-13 22:22
1 last question to make sure I understand- so essentially if we have 3 apps in 3 different orgs, they agree out of band who will send the `Broadcast`, and then the app's "user" tells via slack/email/whatever the channel's ID to the other 2 orgs, so they can all call `Deliver`, right?

rahulhegde
2017-01-15 01:17
has joined #fabric-consensus-dev

rahulhegde
2017-01-15 01:46
@muralisr @garisingh We have followed the same steps ` https://jira.hyperledger.org/secure/attachment/10378/peerchaincodedev_in_1.0.txt ` using the images from the connect-a-thon/marble application. There is no multi-chain that is tried in our steps (i.e. no-chain id and version specified) and I doubt if these images support it. Logs from Orderer which shows connection between Peer and Orderer is unreachable. This un-reachability problem is very high and causes sometimes deploy/invoke transaction from committing to the ledger. ``` [21:19:45.683] deliver.go:121: [DEBUG] Room for more blocks, activating channel [21:19:47.002] broadcast.go:125: [DEBUG] Batch timer expired, creating block [21:19:47.002] ramledger.go:171: [DEBUG] Sending signal that block 5 has a successor [21:19:47.003] deliver.go:121: [DEBUG] Room for more blocks, activating channel 2017/01/14 21:19:47 grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: write tcp 172.18.0.2:7050->172.18.0.4:38072: write: broken pipe" 2017/01/14 21:19:47 grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: write tcp 172.18.0.2:7050->172.18.0.4:38076: write: broken pipe" 2017/01/14 21:19:48 grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: write tcp 172.18.0.2:7050->172.18.0.4:38084: write: broken pipe" [21:19:49.300] broadcast.go:125: [DEBUG] Batch timer expired, creating block [21:19:49.300] ramledger.go:171: [DEBUG] Sending signal that block 6 has a successor [21:19:49.300] deliver.go:121: [DEBUG] Room for more blocks, activating channel [22:15:28.435] solo.go:60: [DEBUG] Starting new Deliver loop [22:15:28.435] deliver.go:38: [DEBUG] Starting new Deliver loop ... [22:15:28.489] deliver.go:75: [DEBUG] Receiving message Acknowledgement:<Number:4 > [22:15:28.489] deliver.go:78: [DEBUG] Received acknowledgement from client [22:15:28.489] deliver.go:121: [DEBUG] Room for more blocks, activating channel 2017/01/14 22:20:44 grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: write tcp 172.18.0.2:7050->172.18.0.4:38096: write: broken pipe" [22:21:02.431] broadcast.go:125: [DEBUG] Batch timer expired, creating block [22:21:02.432] ramledger.go:171: [DEBUG] Sending signal that block 7 has a successor [22:21:02.432] deliver.go:121: [DEBUG] Room for more blocks, activating channel [22:21:11.717] broadcast.go:125: [DEBUG] Batch timer expired, creating block [22:21:11.717] ramledger.go:171: [DEBUG] Sending signal that block 8 has a successor ... ``` I have already done screen-sharing with @muralisr for setup walk-through and concluded to be no configuration problem. Could you please let me know if there is any work-around to resolve this problem.

muralisr
2017-01-15 10:28
@rahulhegde had a thought...

muralisr
2017-01-15 10:28
can you share the contents of fabric/peer/chaincode/ folder please ?

rahulhegde
2017-01-15 23:20
@muralisr on v1.0 Architecture Peer, Cop and Orderer Images used from ` https://github.com/IBM-Blockchain/connectathon ` Docker composer - ` https://github.com/rahulhegde/playtime/blob/master/docker-compose.zip ` Sample Chaincode - ` https://github.com/rahulhegde/learn-chaincode/tree/master/start `

rahulhegde
2017-01-15 23:21
Do let me know - I can give a try on my setup.

tuand
2017-01-16 15:00
scrum ...

2017-01-16 15:00
@tuand has started a Google+ Hangout for this channel. https://hangouts.google.com/hangouts/_/edclujx44vamdgt4o7vvbl36aqe.

hgabor
2017-01-16 15:48
85%: https://gerrit.hyperledger.org/r/#/c/3635/ please give me some feedback

jyellick
2017-01-16 16:24
@hgabor Done

mtnieto
2017-01-17 12:40
has joined #fabric-consensus-dev

jeffgarratt
2017-01-17 16:51
@jyellick there?

jyellick
2017-01-17 16:53
I am

jeffgarratt
2017-01-17 16:55
just assigned issue to Luis

jeffgarratt
2017-01-17 16:55
preferredMaxBytes message gives the absMaxBytes failure erroneously

jeffgarratt
2017-01-17 16:55
:wink:

jeffgarratt
2017-01-17 16:55
bot a blocker

jeffgarratt
2017-01-17 16:55
not a blocker

sanchezl
2017-01-17 16:55
whoops

jeffgarratt
2017-01-17 16:55
:wink:

jeffgarratt
2017-01-17 16:56
btw, recommended value?

sanchezl
2017-01-17 16:56
512K will be default for now

jeffgarratt
2017-01-17 16:56
I put 10000000 for absMaxBytes I think

jeffgarratt
2017-01-17 16:56
k

jeffgarratt
2017-01-17 16:56
thnx

jeffgarratt
2017-01-17 16:56
1 Million I mean

jeffgarratt
2017-01-17 16:56
not sure I got zeros right

sanchezl
2017-01-17 16:56
99MB for absolute, 512K for preferred.

jeffgarratt
2017-01-17 16:56
got it

jeffgarratt
2017-01-17 16:56
thnx

sanchezl
2017-01-17 17:26
I’m trying the instructions on this page: https://github.com/hyperledger/fabric/blob/dca94df500461440da165066a7cacc3f1580b811/docs/channel-setup.md And I get: ``` 2017-01-17 17:23:25.250 UTC [msp] Sign -> INFO 027 Signing message Error: Got unexpected status: BAD_REQUEST ```

sanchezl
2017-01-17 17:27


muralisr
2017-01-17 17:28
@jyellick is the above issue from @jeffgarratt related to problem I encountered with orderer.template ?

muralisr
2017-01-17 17:28
@sanchezl I just submitted a fix for that

sanchezl
2017-01-17 17:28
might be, but I’m not seeing the error message

muralisr
2017-01-17 17:29
orderer.template needed to be regened again for some reason


sanchezl
2017-01-17 17:33
I added new config properties. Now I know. :grimacing:

muralisr
2017-01-17 17:36
ah. it was you @sanchezl :slightly_smiling_face:

yacovm
2017-01-17 21:40
if I have an MSP config item, what type of `ConfigurationItem_ConfigurationType` is that?

yacovm
2017-01-17 21:45
(policy, right?)

jyellick
2017-01-17 22:17
Already answered via DM, but it should be of type MSP, which, as you pointed out does not currently exist.


jyellick
2017-01-17 22:18
Which would be nice to get merged at some point soon.

jyellick
2017-01-17 22:18
It is going to break bdd and sdk a bit, but they have all been warned, and I think the sooner, the less painful

scottz
2017-01-18 14:32
@sanchezl Have we put the message size checks in all the right places? Regarding that max preferred message length (512K) - it looks like it is used for a max batch size. But a single transaction can arrive and be accepted as long as it is less than absolute size (99MB). So we could easily receive a single transaction and be unable to batch it up and deliver it... (Please tell me I missed something during my code review.)

sanchezl
2017-01-18 14:34
If the transaction message is larger than the preferred size, it will be in it’s own batch of 1 message, as long as it’s less than the absolute size.

jonathanlevi
2017-01-18 17:27
@mcoblenz (Michael)

mcoblenz
2017-01-18 17:27
has joined #fabric-consensus-dev

jonathanlevi
2017-01-18 17:30
@mcoblenz: we are slowly moving from a channel with 5180 people, to a 911-people one, and onto a 263-people channel… at this rate you'd get an answer within a block or two :wink:

bur
2017-01-18 17:33
has joined #fabric-consensus-dev

yuryandreev
2017-01-19 10:50
has joined #fabric-consensus-dev

yuryandreev
2017-01-19 10:51
Can we make few orderers in current version fabric (from master)? Or for consensus we need to use Kafka, now?

yacovm
2017-01-19 10:53
there is a solo orderer

yuryandreev
2017-01-19 10:56
why? if something going wrong we will not have “orderer"

yacovm
2017-01-19 10:59
I never said it should be used in production, it just exists

yacovm
2017-01-19 11:00
there is also sbft

cbf
2017-01-19 13:39
@yuryandreev solo is primarily for testing and development - it isn't meant for production use. Kafka is primarily for environments that don't have a need for BFT, and are satisfied by crash fault tolerance, such as a blockchain solution deployed within a single enterprise or where there is a single trusted central authority. We are also working on sbft orderer that will be finalized after the v1.0 release. There will likely be other alternatives developed over time.

gengjh
2017-01-19 13:57
@cbf will we support to migrate the orderer service from Kafka to sbft, since you mentioned the sbft will available AFTER v1.0 release

bjorn
2017-01-19 14:48
has joined #fabric-consensus-dev

grapebaba
2017-01-19 15:08
@gengjh: @cbf explained clear, they should be used in different scenarios, I assume it doesn't exist 'Migrate' relation.

jyellick
2017-01-19 15:58
@gengjh I can say that there is no _technical_ impossibility with migrating between consensus types, whether it is supported is another question

kostas
2017-01-19 15:59
Whether it's a practically useful thing to do is also another question.

cbf
2017-01-19 16:24
+1

gengjh
2017-01-20 01:29
ok, clear


hgabor
2017-01-20 16:10
https://gerrit.hyperledger.org/r/#/c/3863/ this one is WIP but needs feedback

hgabor
2017-01-20 16:21
guys, as you may know, there is a problem with SBFT tests on ppc64

hgabor
2017-01-20 16:22
I tried to reproduce the error and succeeded but it could not find a solution yet

hgabor
2017-01-20 16:22
any help is welcome :smile:

kostas
2017-01-20 16:40
@hgabor: Does it have something to do with the output that I'm also seeing here? https://jenkins.hyperledger.org/job/fabric-verify-x86_64/5455/consoleFull

alanlee
2017-01-21 01:24
has joined #fabric-consensus-dev

alanlee
2017-01-21 01:26
Questions: (1) Do we have any document on how PBFT in Hyperledger works? (2) If the network has >50 nodes with many transactions, is current implementation good for production? Thank you very much.

tuand
2017-01-21 01:47
@alanlee hyperledger fabric v0.6 is a pretty faithful implementation of the algoritm described in the PBFT paper by Castro & Liskov

alanlee
2017-01-21 02:02
Thanks @tuand .

down-the-fall-line
2017-01-21 22:52
has joined #fabric-consensus-dev

miketwenty1
2017-01-21 22:55
has joined #fabric-consensus-dev

miketwenty1
2017-01-21 22:55
hello

miketwenty1
2017-01-21 22:57
I was hoping to know more about the fabric blockchain, anyone here to field questions?

garisingh
2017-01-21 23:03
@miketwenty1 - people will usually get back to you. there are a few folks on US east coast and a few folks in Europe who use this channel

miketwenty1
2017-01-21 23:09
ok, so my first question is.. what advantages do blocks provide vs not using blocks

silliman
2017-01-21 23:21
@miketwenty1 Have you seen the this bitcoin paper? https://bitcoin.org/bitcoin.pdf It's a relatively short read at 9 pages. Check out section 4- see how the hashes are at the Block level, not per transaction? While implementation details differ, this is conceptually what Fabric is doing as well- that's why we need blocks. If you set your configuration parameters for max of 1 transaction per block, you would "in essence" be chaining at the transaction level, but it would still be wrapped by "blocks". (Of course you might be killing yourself performance-wise depending on your expected transaction volume but that's a different story)

garisingh
2017-01-21 23:21
are you questioning the hash chain part of blocks or the batching or multiple transactions in blocks or both?

simers
2017-01-22 15:24
has joined #fabric-consensus-dev

miketwenty1
2017-01-23 04:27
@silliman yeah good point with the 1 transaction per block analogy. So I get why blocks are used with the white paper you posted, it's too group in transactions with proof of work..(please correct me if I'm wrong).. if PoW isn't being done maybe you could just use a normal immutable queue like Kafka? Kafka doesn't need the overhead of blocks but still maintains good consistency and sequencing of events.

miketwenty1
2017-01-23 04:29
@garisingh hmm not really sure I would say batching into blocks is confusing.. doesn't seem to do much for security if you are in a permission based system. I would like to explore this with someone.

kostas
2017-01-23 07:00
@miketwenty1 -- these are reasonable questions. (I've had similar doubts in the past.)

kostas
2017-01-23 07:00
1. Batching into blocks is an optimization. See Problem 1 + Solution 1 here: https://docs.google.com/document/d/1vNMaM7XhOlu9tB_10dKnlrhy5d7b1u8lSY8a-kVjCO4/edit

kostas
2017-01-23 07:01
2. Hash-chaining is also an optimization because a hash-chained sequence means that you just need to verify the signatures on the tip of the chain, and the rest follows from the hash-chaining, i.e. no more signature verification needed, you just make sure that the block hashes match.

kostas
2017-01-23 07:01
3. Hash-chaining also provides some extra safety against forging/tampering. Extreme of a scenario as this might be, assume that the tip is now at block 1M and we don't do hash-chaining. There might be an attack where I focus all of my computing resources in coming up with block 2M that will carry my own, not-quite-right transactions. I now have 1M blocks ahead of me in order to come up with the right signature over the predictable payload `block number = 2,000,000 | transactions = evil-transaction-goes-here`. This is not an easy task by any means, but if we had a hash-chain, it would definitely be _much_ harder, because you cannot predict the payload of block `n` until you get block `n-1`. There's also an attack where if you do periodic key rotation and an earlier key gets leaked, the adversary will only be able to rewrite history up to a point (up to where the key was valid) and this tampering will be immediately obvious because the hash-chain is broken.

hgabor
2017-01-23 09:40
where can I read about configuration transactions? I mean something high level, e.g. we have these docs: https://wiki.hyperledger.org/community/fabric-design-docs

hgabor
2017-01-23 12:06
please give it some love in the form of a +2


hgabor
2017-01-23 12:07
or even +3s are welcome

hgabor
2017-01-23 12:07
but it would be good if I didn't have to rebase it again as it is a pain.. you know :smile:

jyellick
2017-01-23 14:39
@hgabor https://docs.google.com/document/d/1Qg7ZEccOIsrShSHSNl4kBHOFvLYRhQ3903srJ6c_AZE/edit contains information on the configuration transaction (towards the end)

hgabor
2017-01-23 14:40
is that document linked on this page? https://wiki.hyperledger.org/community/fabric-design-docs

jyellick
2017-01-23 14:41
Yes

jyellick
2017-01-23 14:41
The MSP + ACL document

hgabor
2017-01-23 14:42
nice

hgabor
2017-01-23 14:42
thanks

hgabor
2017-01-23 14:42
I did not know that it is in the MSP+ACL docs

hgabor
2017-01-23 14:44
is this trivial?

jyellick
2017-01-23 15:23
Not entirely sure what you mean

jyellick
2017-01-23 15:23
I'd say it's not trivial, might be worth putting into its own doc

hgabor
2017-01-23 15:33
yeah, that is what I meant :slightly_smiling_face:

jyellick
2017-01-23 15:51
The concept of the configuration transaction is simple enough, can be explained in a few sentences

jyellick
2017-01-23 15:52
The details of what pieces map to what have gotten increasingly complicated though since its inception

hgabor
2017-01-23 16:10
yep, but I think having docs on everything easily available is a must

hgabor
2017-01-23 16:11
there is a doc on that config tx thing and that's great, but it is a little bit (maybe) hard to find

hgabor
2017-01-23 16:21

jyellick
2017-01-23 17:11
+1 ^ I have reviewed this, would appreciate if others would as well

miketwenty1
2017-01-23 18:00
@kostas this is very interesting.. we are using kafka at my job. But we are simply making the customers/services auth before they are able to write.. what added security are we achieving with this system? It almost seems like the benefits of this is to see if something has been tampered with, but I’m wondering under what pretext could something be tampered with if you it’s a permissioned based system in the first place.. if the nodes that are the oracles of this system allow something to be added that is fraudulent/false it shouldn’t allow the write to happen in the first place without the proper digital signature.. right? I feel like the system will be severely compromised if this kind of activity were able to take place, and hashes and such could be re-written with signatures to rewrite events.. like to hear your thoughts.

kostas
2017-01-23 18:03
@miketwenty1: Isn't tampering orthogonal to whether the system is permission-based or not?

miketwenty1
2017-01-23 18:05
isn’t tampering relative… if a customer or service has a proper key/ or has authed.. it looks legitimate when it writes or puts data into a topic right?

miketwenty1
2017-01-23 18:05
i see it being extremely hard to go back and tamper with an entry in the generic sense of going back in and changing something in a kafka topic..

miketwenty1
2017-01-23 18:06
it’s an append only queue right?

jyellick
2017-01-23 18:10
@miketwenty1 although Kafka may not support modifying items which have been queued, there is no technical reason a sufficiently motivated attacker with access to the brokers could not do this, and there would be no way to detect or refute this change

kostas
2017-01-23 18:10
Right, I imagine that given enough incentive this isn't impossible.

miketwenty1
2017-01-23 18:12
@jyellick good point, little green on this still.. how does fabric prevent this? I was thinking you could just rewrite the hashes if needed and it would also be undetectable.

kostas
2017-01-23 18:14
You could absolutely rewrite history that way, which is where a periodic key rotation would come handy.

jyellick
2017-01-23 18:14
@miketwenty1 Anyone who already has a copy of the chain, could refute this, and would be able to trivially locate the change, based on where the hash chains diverge. Additionally, because blocks are signed by the ordering service, the attacker would have to have access to the signing keys of the orderers and falsify a signature for every block. But, yes, if the keys are compromised, a new chain could be forged.

jyellick
2017-01-23 18:15
And as @kostas points out, if keys are periodically rotated (and destroyed) this should become impossible for blocks whose key has already been rotated

miketwenty1
2017-01-23 18:18
it seems like this kind of idea of fabric comes into handy when auditors or another party wants to be able to say with more assurety that something hasn’t been tampered with very quickly. key rotation is mandated by protocol? or does it kind of have a TTL? i imagine a rotating key would just prevent bad new writes not rewrites

kostas
2017-01-23 18:20
I think what makes these arguments a bit difficult to digest is that we're parsing them in a Kafka (fully-trusted essentially) environment, where an attack is theoretically easier. Once you start parsing them in a BFT context where multiple signatures are needed, things become much more difficult for the attacker.

kostas
2017-01-23 18:21
Indeed a rotating key would still allow you to rewrite up to a point. And then a hashchain would make it obvious that there's a point of divergence, assuming you know that for block 100 and onward you're supposed to be getting signatures from pubKey `foo`.

miketwenty1
2017-01-23 18:24
@kostas this has been really helpful information. Any thoughts or experiments on optimal *blocksize/blocktimes* (is this good terminology you are using in fabric?)

kostas
2017-01-23 18:29
There is a lot of work to be done there for sure. @bcbrock has done some great preliminary work on batch size (blocksize) https://jira.hyperledger.org/browse/FAB-1171, but that's all we have for now. We're almost there with adding functionality at this point. Next on my list is to add instrumentation, and then work on perf. evaluations and optimizations.

miketwenty1
2017-01-23 19:09
@kostas one more question.. forgot.. where does persisting state come in through couchdb? isn’t data persisted in kafka itself?

kostas
2017-01-23 19:35
@miketwenty1: So Kafka is the message bus. It persists data, but not forever, unless the administrator chooses so. Once an ordering node gets the (orderer) messages from Kafka for a chain, it persists them locally in a CouchDB or what-have-you ledger instance.

dave.enyeart
2017-01-23 19:37
@miketwenty1 It would probably help if you review the ledger overview charts in https://jira.hyperledger.org/browse/FAB-758

miketwenty1
2017-01-23 19:42
@dave.enyeart i see _Ability to retrieve past values/trans for key (simple provenance), uses new index into blockchain_ Which part exactly is the blockchain? or the whole system is considered a blockchain?

dave.enyeart
2017-01-23 19:44
we often consider the whole system a blockchain and therefore we are usually specific, for example ledger is comprised of the block hash chain on the file system, some indexes into that (leveldb), as well as a state database in either leveldb or couchdb

hgabor
2017-01-24 11:37
could somebody familiar with our crypto stuff help me?


hgabor
2017-01-24 11:37
I get this on CI

hgabor
2017-01-24 11:37
but it works OK on my machine

hgabor
2017-01-24 11:39
yeah, it is the starting location...

yacovm
2017-01-24 15:32
@adc :arrow_up_small:

adc
2017-01-24 15:49
@hgabor, I'm looking at test and it looks like that the wrong path is chosen

adc
2017-01-24 15:50
/etc/hyperledger/msp/sampleconfig/cacerts should /etc/hyperledger/fabric/msp/sampleconfig/cacert

adc
2017-01-24 15:50
it is just my educated guess, I haven't written that test code

hgabor
2017-01-24 15:54
@adc sorry I found out: > yeah, it is the starting location... it did not find the config

adc
2017-01-24 15:55
okay :slightly_smiling_face:

yacovm
2017-01-24 16:43
uh, is anyone going to fix this? ^

sanchezl
2017-01-24 16:44


sanchezl
2017-01-24 20:00
@hgabor , I found this in another channel.

hgabor
2017-01-25 12:33
@sanchezl thx

hgabor
2017-01-25 12:34
I am about to move sbft's proto files into protos/common or somewhere to there - any objection? :slightly_smiling_face:

vukolic
2017-01-25 12:48
as discussed on DM - let's move to protos/sbft

kostas
2017-01-25 14:40
`protos/orderer/sbft` likely a better home.

jyellick
2017-01-25 14:48
+1

hgabor
2017-01-25 15:12
okay I will move them to there

vukolic
2017-01-25 19:22
@jyellick @kostas can you pls remind me what are the committers doing in blockcutter

jyellick
2017-01-25 19:22
@vukolic For normal transactions (peer ones) they are no-op

jyellick
2017-01-25 19:23
For configuration transactions, and some internal bookkeeping transactions, they modify the state of the orderer (by creating new channels, modifying who is allowed to read/write to a channel, etc.)

vukolic
2017-01-25 19:23
so what are they - transactions?

vukolic
2017-01-25 19:23
configuration transactions?

vukolic
2017-01-25 19:27
what I was afraid a bit from just materialized in blockcutter support bloating sbft code by 20-25%

vukolic
2017-01-25 19:28
and the code is not "simple" as it used to be

vukolic
2017-01-25 19:29
anyway

vukolic
2017-01-25 19:29
if committers are configuration transactions - they should be called as such

vukolic
2017-01-25 19:30
furthermore we discussed the name Ordered - and I really think that MUST change

jyellick
2017-01-25 19:33
I agree the name ordered should change

jyellick
2017-01-25 19:33
And that the interface as a whole could probably use some renaming

jyellick
2017-01-25 19:34
We should probably not be returning no-op committers

jyellick
2017-01-25 19:34
Instead, we should only return committers for transactions which modify orderer state

jyellick
2017-01-25 19:34
It is definitely going to complicate the sbft code, I see no real way around this.

jyellick
2017-01-25 19:34
SBFT gets away today under the assumption that the orderer maintains no state

jyellick
2017-01-25 19:35
But this is not true. The orderer maintains state, but it does not frequently modify it.

jyellick
2017-01-25 19:35
You could set the state at genesis, and leave it that way forever, and then things would be fine

jyellick
2017-01-25 19:36
But, if we want to do things like access control, with a dynamic set of authorized credentials, then we have to introduce the notion of state

vukolic
2017-01-25 19:36
that is all fine - there is a need for orderers to interpret some transactions

vukolic
2017-01-25 19:36
that said that are quite a few changes needed here to maintain the attribute "simple"

jyellick
2017-01-25 19:37
My loose thought, was that block cutter needs to return whether or not a batch modifies orderer state.

jyellick
2017-01-25 19:37
Because if it does, this state must be committed before the blockcutter (probably more appropriately called batchcutter) processes any new transactions

jyellick
2017-01-25 19:38
(Because the state change might make some future transactions valid or invalid, that change must commit first)

vukolic
2017-01-25 19:38
not sure now what are you referring to

jyellick
2017-01-25 19:39
Imagine the current orderer state is, "Allow transactions from A and B"

jyellick
2017-01-25 19:40
You are operating with a batch size of 1, and see the following flow of transactions tx from A reconfiguration tx says "Allow transactions from B and C" tx from A tx from C The correct resulting block series looks like: [tx.A, tx.Reconf, tx.C]

jyellick
2017-01-25 19:41
A consensus implementation which tries to pipeline too much around the state modifying reconfig transaction could end up with: [tx.A, tx.Reconf, tx.A] If the reconfiguration commits after the third block has been cut

vukolic
2017-01-25 19:41
not really

vukolic
2017-01-25 19:41
ok let's start slowly

vukolic
2017-01-25 19:41
1) s bft does not have pipelining

jyellick
2017-01-25 19:42
Are we certain? Does the primary prepare batches ahead of time? Or only once it has an available sequence number?

vukolic
2017-01-25 19:42
2) pipelining can be done for all none-config tx

vukolic
2017-01-25 19:42
we are certain

jyellick
2017-01-25 19:42
Okay

vukolic
2017-01-25 19:42
it is built that way

vukolic
2017-01-25 19:42
intenionally

jyellick
2017-01-25 19:42
I agree, for non-config txs, pipeline can (and should) be done

vukolic
2017-01-25 19:42
with pipelining pending

jyellick
2017-01-25 19:42
Right

vukolic
2017-01-25 19:42
now

vukolic
2017-01-25 19:43
pipelining should be done whenever there are no config tx for throughput this is way future work

vukolic
2017-01-25 19:43
3) pipielining can be also done with config tx

vukolic
2017-01-25 19:43
this is called speculative execution

vukolic
2017-01-25 19:44
where the orderer would change the state *speculatively* upon processing pre-prepare of a config transaction with pieplining on

vukolic
2017-01-25 19:44
this is more involved but was proposed in literature

vukolic
2017-01-25 19:44
I am talking now here more about the complexity of blockcutter

vukolic
2017-01-25 19:45
given that *current* sbft did not have pipelining anyway

vukolic
2017-01-25 19:45
+300 lines of code on 1200 lines code base blows my mind

vukolic
2017-01-25 19:45
but I guess I will have to try to wrap my head around it

vukolic
2017-01-25 19:45
and there the naming does not help

vukolic
2017-01-25 19:45
for one thing

vukolic
2017-01-25 19:46
so back to that

vukolic
2017-01-25 19:47
filter.committers could be filter.configTX?

vukolic
2017-01-25 19:47
and Ordered could be ValidateAndAppend?

vukolic
2017-01-25 20:02
it should actually just be Validate

jyellick
2017-01-25 20:12
I am not in love with `Validate`, but it is orders of magnitude better than `Ordered` (sorry for inflicting that name on the world)

jyellick
2017-01-25 20:13
And I hesitate to simply call it configtx, because that has a very specific implication, there are other sorts of transactions which can modify orderer state or must be otherwise handled specially

vukolic
2017-01-25 20:16
so I see two things re Ordered

vukolic
2017-01-25 20:16
one is the SBFT Ordered which basically is now a System API call

vukolic
2017-01-25 20:16
that should be Validate in absence of better name

vukolic
2017-01-25 20:17
I would leave BC.Ordered naming to you but strongly suggest renaming

vukolic
2017-01-25 20:17
re configtx and committers, let's just understand what committers are

vukolic
2017-01-25 20:17
shall we call them systemtx?

vukolic
2017-01-25 20:18
like in filter.systemtx

jyellick
2017-01-25 20:25
Sounds reasonable

vukolic
2017-01-25 20:26
SystemTransactions or SystemTX?

jyellick
2017-01-25 20:28
`SystemTx` would be my preference, We use `ConfigTx` in some places of the code

jyellick
2017-01-25 20:29
With respect to `blockcutter.Ordered` I'm inclined to completely rename this, I don't think `blockcutter` is a good name, I would now think something like `batchcutter.ProposeInclusion` or something like that, though I don't even really care for that

vukolic
2017-01-25 20:31
Cannot rename due to errors: C:\gocode\src\http://github.com\hyperledger\fabric\examples\chaincode\go\asset_management_interactive\app1\app1.go:35:17: PeerClient not declared by package pb C:\gocode\src\http://github.com\hyperledger\fabric\examples\chaincode\go\asset_management_interactive\app1\a

vukolic
2017-01-25 20:31
I was trying to rename Committers but this came up...

vukolic
2017-01-25 20:32
The problem I have with BC is that it tries to do many things at once

vukolic
2017-01-25 20:33
this is: 1) validation of msgs, e.g., signature checks, 2) sizing the batch/block, 3) filtering out system tx, 4) appends/includes tx to the current batch

vukolic
2017-01-25 20:33
these are all valid things to do

vukolic
2017-01-25 20:33
but together make BC a bit convoluted

vukolic
2017-01-25 20:33
esp. with Ordered() and Cut() calls

jyellick
2017-01-25 20:36
We can certainly split this into multiple pieces if that is useful

jyellick
2017-01-25 20:36
The interface makes the implementation of solo trivial

jyellick
2017-01-25 20:37
And I think it is fairly straightforward in Kafka as well. I did my best to make it usable by SBFT, but maybe it needs to be broken into pieces

vukolic
2017-01-25 20:37
it's just that I never saw the need for things like this to bring 200 fresh lines of code to SBFT

vukolic
2017-01-25 20:37
anyway - we have it merged now so let's work with what we have

vukolic
2017-01-25 20:38
btw, any hint on that strange renaming issue

vukolic
2017-01-25 20:38
I am not sure why an app would stop renaming at the orderer level

jyellick
2017-01-25 20:44
I'm not sure, I am not a big IDE fan, I stick to hacking in vim (so typically just do such things manually)

vukolic
2017-01-25 20:46
would do it that way but the thing is all over the place

vukolic
2017-01-25 21:09
no this Committer is impossible to rename

vukolic
2017-01-25 21:10
i will rename it in sbft scope only

vukolic
2017-01-25 21:23
@jyellick @hgabor @kostas @binhn I would kindly ask that future sbft merges wait for my code review

vukolic
2017-01-25 21:23
thanks in advance

vukolic
2017-01-25 21:41
@jyellick given two blocks at number 25 and 54

vukolic
2017-01-25 21:42
but without blocks in between

vukolic
2017-01-25 21:42
is there are a way to found out, from Committers in 25 and Committers in 54,

vukolic
2017-01-25 21:42
were there Committers in blocks 26-53

vukolic
2017-01-25 21:42
?

vukolic
2017-01-25 21:44
in other words, do Committers have sequential numbers?

jyellick
2017-01-25 21:49
Yes

jyellick
2017-01-25 21:50
There is a `LastConfigurationIndex` field in the block metadata, which indicates the block number of the last configuration transaction

jyellick
2017-01-25 21:51
It by default is only signed per orderer

jyellick
2017-01-25 21:51
But you could add as many signatures as you like

vukolic
2017-01-25 21:51
so committers = configuration tx? :slightly_smiling_face:

jyellick
2017-01-25 21:51
So, from the outside world, this is largely correct

jyellick
2017-01-25 21:51
Internally, we play so odd games with translating tx types

vukolic
2017-01-25 21:51
I mean, LastConfigurationIndex counts committers, right?

jyellick
2017-01-25 21:53
`LastConfigurationIndex` is the the last block number which had a committer, for normal chains. there are some edge cases (that I'm hoping will go away) to make the 'ordering system chain' more like the rest, but, that is WIP

vukolic
2017-01-25 21:53
so, you see why I am asking

vukolic
2017-01-25 21:54
because orderers have this state independent from ordinary txs

vukolic
2017-01-25 21:54
and configuration txs should be rare

vukolic
2017-01-25 21:54
this immensely simplifies state transfer at (sbft) orderers

vukolic
2017-01-25 21:54
which means if I have a gap from 26-53, I could look at LastConfigIndex(54)

vukolic
2017-01-25 21:55
and if LastConfigIndex(54) <=25

vukolic
2017-01-25 21:55
I am good to proceed with the ordering without filling in the gap

vukolic
2017-01-25 21:55
agree?

vukolic
2017-01-25 21:56
and also if X := LastConfigIndex(54)>25 then the orderer should just fetch block X

vukolic
2017-01-25 21:56
and repeat that while LastConfigIndex(X)>25

jyellick
2017-01-25 22:13
Yes, I absolutely understand

jyellick
2017-01-25 22:13
And agree

jyellick
2017-01-25 22:14
I would note however, that I would vote to never `Deliver` a block unless the orderer has already processed all previous blocks

vukolic
2017-01-25 22:15
orderer might not deliver but we call Deliver from sbft process

vukolic
2017-01-25 22:15
because fill in the gap should be independent from sbft

jyellick
2017-01-25 22:15
Yes, agreed

vukolic
2017-01-25 22:15
if there is xyzbft it would face the similar problem

vukolic
2017-01-25 22:15
and would have to re-solve it

vukolic
2017-01-25 22:15
so the idea is to solve it at the current sbft backend level

vukolic
2017-01-25 22:15
not within simplebft itself

vukolic
2017-01-25 22:16
as a thread separate from main simplebft thread

jyellick
2017-01-25 22:20
Makes sense

simon
2017-01-26 00:25
what was 200 lines?

vukolic
2017-01-26 00:42
@simon new way of batching

tom.appleyard
2017-01-26 11:10
odd question but how big can a hyperledger network get before there is a noticeable drop in performance due to the amount of nodes that require consensus?

tom.appleyard
2017-01-26 11:11
(I guess this would be with 0.6 pfbt or with the new conensus setup if we have predictions for that)

jyellick
2017-01-26 14:38
@tom.appleyard I think @bcbrock is probably the best one to answer your question

simon
2017-01-26 17:40
if you want feedback, just tag me here and I'll have a look

mcoblenz
2017-01-26 19:44
What should a newcomer read in order to understand how consensus works on Fabric?

cca
2017-01-26 20:21
@mcoblenz - design docs arelinked from here: https://wiki.hyperledger.org/community/fabric-design-docs

cca
2017-01-26 20:21
BFT consensus, specifically here: https://jira.hyperledger.org/browse/FAB-378

mcoblenz
2017-01-26 20:21
Great, thanks. Those are relatively up to date?

cca
2017-01-26 20:22
yes, for V1

mcoblenz
2017-01-26 20:22
ok, thanks! I’ll read through those.

cca
2017-01-26 20:23
For in-depth learning on BFT protocols including PBFT, see a textbook like this one : http://www.distributedprogramming.net :wink:

bcbrock
2017-01-26 20:54
@tom.appleyard Unfortunately I haven’t done that kind of scaling study.

tom.appleyard
2017-01-27 09:08
@bcbrock ah unfortunate, no worries though - was just curious to see if such a thing had been done

kostas
2017-01-27 15:54
@archivebot I summon thee

archivebot
2017-01-27 15:54
has joined #fabric-consensus-dev


weeds
2017-01-28 10:10
@tom.appleyard I think there is going to be a huge difference in what is in version 0.6 and what is being developed in Master. I don't know if you were asking though. For 0.6 though,.. i know we at least got up to 15+ nodes without any noticeable performance degradation. And i know we have customers in production with 0.6 well over 300,000 blocks with at least 1000 transactions each running at this point in time.. I know there are others that might have gone above these numbers, but this is what I remember off hand

weeds
2017-01-28 10:10
For version 1.0, an event occurred in december with a few companies where we connected 12+ nodes that were located in different parts of the country, running chaincode without issue... Although we did not measure performance.

weeds
2017-01-28 10:11
Performance is always a tricky thing though- I could see variations in terms of chaincode that has been written, number of nodes,etc,..

tom.appleyard
2017-01-28 12:33
@weeds Info on both 0.6 and 1.0 is useful - I was asked this yesterday and didn’t know so thought I’d check. On what you did say though - fantastic, when you say ‘noticeable performance degradation’, how are you measuring performance and by how much is ‘noticeable’ - how does the degradation increase as more are added?

tom.appleyard
2017-01-28 12:33
thanks!

dave.enyeart
2017-01-28 18:27
@jyellick @kostas It appears Orderer doesn’t cut a block for 10 seconds after it receives a transaction. This gets a little annoying during iterative end-to-end test. How do I configure it down to 1s or 2s? And should we lower the default?

sanchezl
2017-01-28 18:28
dave.enyeart: There is also a MaxMessageCount setting. Send enough messages to cut a block immediately. You can set this all the way down to 1.

dave.enyeart
2017-01-28 18:29
I changed MaxMessageCount to 1 in orderer.yaml, didnt see a difference. Maybe I need to recreate my channel though.

sanchezl
2017-01-28 18:32
Yes, either a new chain with the defaults in the order.yaml changed, or send a config transaction to the chain that alters its settings.

garisingh
2017-01-28 18:32
recreate sounds easier :wink:

dave.enyeart
2017-01-28 18:33
I re-created chain with my updated orderer.yaml and it still doesnt cut a block for 10s

dave.enyeart
2017-01-28 18:37
by re-create, i mean `peer channel create -c myc1`

sanchezl
2017-01-28 18:37
where `myc1` is a new one?

dave.enyeart
2017-01-28 18:37
i deleted the old myc1

dave.enyeart
2017-01-28 18:37
let me try myc2

dave.enyeart
2017-01-28 18:41
same thing with the new myc2… still a 10s delay

sanchezl
2017-01-28 18:42
There is this file, `common/configtx/test/orderer.template`, that I’m not 100% sure how it’s being used, but that contains the actual default values used I think.

sanchezl
2017-01-28 18:44
There is supposed to be some tooling in `orderer/tools/configtemplate/` to regenerate it. maybe @muralisr or @binhn can provide some more insight.

dave.enyeart
2017-01-28 18:46
I edited orderer.template by hand, and it broke `peer channel create`

sanchezl
2017-01-28 18:46
it’s a binary file.

dave.enyeart
2017-01-28 18:46
will live with the 10s for now

muralisr
2017-01-28 18:48
```Genesis: # Orderer Type: The orderer implementation to start # Available types are "solo" and "kafka" OrdererType: solo # Batch Timeout: The amount of time to wait before creating a batch BatchTimeout: 10s # Batch Size: Controls the number of messages batched into a block BatchSize: # Max Message Count: The maximum number of messages to permit in a batch MaxMessageCount: 10 ``` setting MaxMessageCount: 1 in orderer.yaml doesn’t help ?

muralisr
2017-01-28 18:49
or BatchTimeout: 1s

dave.enyeart
2017-01-28 18:49
nope, tried both

sanchezl
2017-01-28 18:49
@muralisr , would he have to re-generate the template? ``` Firehawk:hyperledger sanchezl$ cd fabric/orderer/tools/ Firehawk:tools sanchezl$ ls -l total 0 drwxr-xr-x 3 sanchezl staff 102 Jan 24 10:00 configtemplate Firehawk:tools sanchezl$ cd configtemplate/ Firehawk:configtemplate sanchezl$ ls -l total 8 -rw-r--r-- 1 sanchezl staff 1697 Jan 24 10:00 main.go Firehawk:configtemplate sanchezl$ go build Firehawk:configtemplate sanchezl$ ls -l total 27536 -rwxr-xr-x 1 sanchezl staff 14092908 Jan 28 13:47 configtemplate -rw-r--r-- 1 sanchezl staff 1697 Jan 24 10:00 main.go Firehawk:configtemplate sanchezl$ ./configtemplate --help Usage of ./configtemplate: -outputFile string The file to write the configuration templatee to (default "orderer.template") ```

muralisr
2017-01-28 18:50
ok

muralisr
2017-01-28 18:52
if you do that and `cp orderer.template to $GOPATH/src/github.com/hyperledger/fabric/common/configtx/test` I guess ?

muralisr
2017-01-28 18:53
if its now baked into the template, I supposed we have to do that

sanchezl
2017-01-28 18:54
I think the baking-in was a temporary move, at least that’s what the change set says.

muralisr
2017-01-28 18:56
ok

dave.enyeart
2017-01-28 18:57
that worked

dave.enyeart
2017-01-28 18:57
spent 30 minutes but saved 9s :slightly_smiling_face:

dave.enyeart
2017-01-28 18:59
ok, figured it out in the side thread. But we might want to change the default to 1s or 2s to make end-to-end iterative trials more pleasant for people

muralisr
2017-01-28 19:14
:slightly_smiling_face:

dave.enyeart
2017-01-28 19:23
opened a jira item for orderer config updates (at least in trial environments): https://jira.hyperledger.org/browse/FAB-1919

stchrysa
2017-01-29 13:11
@stchrysa has left the channel

sagmeister
2017-01-30 14:50
has joined #fabric-consensus-dev


hl.rose
2017-01-30 23:55
has joined #fabric-consensus-dev

passkit
2017-01-31 03:13
has joined #fabric-consensus-dev

shawn
2017-01-31 06:29
has joined #fabric-consensus-dev

eragnoli
2017-01-31 09:22
has joined #fabric-consensus-dev

karkal72
2017-02-02 00:06
has joined #fabric-consensus-dev

hgabor
2017-02-02 14:01
guys please review my commits

hgabor
2017-02-02 14:01

jyellick
2017-02-02 16:07
@hgabor Please add me as a reviewer to any commits you'd like me to take a look at, it is much more likely to get my attention that way

hgabor
2017-02-02 16:09
okie I will

hgabor
2017-02-02 16:09
sorry :disappointed:

jyellick
2017-02-02 16:14
No harm to me! Just don't want your CRs to languish

weeds
2017-02-02 23:41
hi everybody, in case you have not heard, linux foundation is moving us off of Slack and onto Rocket.Chat, everybody in Slack can log in to the new chat server using your existing linux foundation ID, please visit http://chat.hyperledger.org/ to login to the new chat server

jimyang
2017-02-03 02:44
has joined #fabric-consensus-dev

beauson45
2017-02-03 04:59
has joined #fabric-consensus-dev

hgabor
2017-02-03 12:08

scottz
2017-02-03 15:58
@jyellick @kostas please who can we assign this issue? https://jira.hyperledger.org/browse/FAB-2001

jyellick
2017-02-03 15:59
Feel free to assign it to me

kostas
2017-02-03 16:03
(I am here to help if need be.)

kostas
2017-02-03 16:05
By the way, these will fall into our radar automatically if filed under the `fabric-consensus` component in JIRA.

scottz
2017-02-03 18:10
thanks

jzhang
2017-02-03 19:54
@jzhang has left the channel

ry
2017-02-07 18:36
has joined #fabric-consensus-dev

ry
2017-02-07 18:36

ry
2017-02-07 18:36
@ry archived the channel (w/ 279 members)